representation learning for visual-relational knowledge graphsdaniel oñoro-rubio nec labs europe...

10
Unpublished working draft. Not for distribution. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 Representation Learning for Visual-Relational Knowledge Graphs Daniel Oñoro-Rubio NEC Labs Europe Heidelberg, Germany [email protected] Mathias Niepert NEC Labs Europe Heidelberg, Germany [email protected] Alberto García-Durán NEC Labs Europe Heidelberg, Germany [email protected] Roberto González-Sánchez NEC Labs Europe Heidelberg, Germany [email protected] Roberto J. López-Sastre University of Alcalá Alcaá de Henares, Spain [email protected] ABSTRACT A visual-relational knowledge graph (KG) is a multi-relational graph whose entities are associated with images. We introduce Image- Graph 1 , a KG with 1,330 relation types, 14,870 entities, and 829,931 images. Visual-relational KGs lead to novel probabilistic query types where images are treated as first-class citizens. Both the prediction of relations between unseen images and multi-relational image retrieval can be formulated as query types in a visual-relational KG. We approach the problem of answering such queries with a novel combination of deep convolutional networks and models for learning knowledge graph embeddings. The resulting models can answer queries such as “How are these two unseen images related to each other?" We also explore a zero-shot learning scenario where an image of an entirely new entity is linked with multiple relations to entities of an existing KG. The multi-relational grounding of unseen entity images into a knowledge graph serves as the description of such an entity. We conduct experiments to demonstrate that the proposed deep architectures in combination with KG embedding objectives can answer the visual-relational queries efficiently and accurately. ACM Reference Format: Daniel Oñoro-Rubio, Mathias Niepert, Alberto García-Durán, Roberto González- Sánchez, and Roberto J. López-Sastre. 2018. Representation Learning for Visual-Relational Knowledge Graphs. In Proceedings of Deep Learning Day (KDD’2018 Workshop). ACM, New York, NY, USA, 10 pages. https://doi.org/ 10.1145/nnnnnnn.nnnnnnn 1 INTRODUCTION Several application domains can be modeled with knowledge graphs where entities are represented by nodes, object attributes by node attributes, and relationships between entities by directed edges be- tween the nodes. For instance, a product recommendation system can be represented as a knowledge graph where nodes represent customers and products and where typed edges represent customer reviews and purchasing events. In the medical domain, there are several knowledge graphs that model diseases, symptoms, drugs, genes, and their interactions (cf. [2, 36, 51]). Increasingly, entities 1 Project URL: https://github.com/nle-ml/mmkb.git. Unpublished working draft. Not for distribution. KDD’2018 Workshop, August, 2018, London, England 2018. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 https://doi.org/10.1145/nnnnnnn.nnnnnnn Tokyo Japan Sensō-ji Gotoh Museum Murasaki Shikibu captialOf locatedIn hasArtAbout bornIn locatedIn Figure 1: Visual-relational knowledge graph. in these knowledge graphs are associated with visual data. For instance, in the online retail domain, there are product and advertis- ing images and in the medical domain, there are patient-associated imaging data sets (MRIs, CTs, and so on). The ability of knowledge graphs to compactly represent a do- main, its attributes, and relations make them an important com- ponent of numerous AI systems. KGs facilitate the integration, organization, and retrieval of structured data and support various forms of reasoning. In recent years KGs have been playing an in- creasingly crucial role in fields such as question answering [6, 9], language modeling [1], and text generation [40]. Even though there is a large body of work on learning and reasoning in KGs, the set- ting of visual-relational KGs, where entities are associated with visual data, has not received much attention. A visual-relational KG represents entities, relations between these entities, and a large number of images associated with the entities (see Figure 1 for an example). While ImageNet [10] and the VisualGenome [23] datasets are based on KGs such as WordNet they are predominantly used as either an object classification data set as in the case of Ima- geNet or to facilitate scene understanding in a single image. With ImageGraph, we propose the problem of reasoning about visual concepts across a large set of images organized in a knowledge graph. 2018-06-29 17:56. Page 1 of 1–10.

Upload: others

Post on 18-Apr-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Representation Learning for Visual-Relational Knowledge GraphsDaniel Oñoro-Rubio NEC Labs Europe Heidelberg, Germany daniel.onoro@neclab.eu Mathias Niepert NEC Labs Europe Heidelberg,

Unpublishedworkingdraft.

Not for distribution.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

Representation Learning for Visual-RelationalKnowledge Graphs

Daniel Oñoro-RubioNEC Labs Europe

Heidelberg, [email protected]

Mathias NiepertNEC Labs Europe

Heidelberg, [email protected]

Alberto García-DuránNEC Labs Europe

Heidelberg, [email protected]

Roberto González-SánchezNEC Labs Europe

Heidelberg, [email protected]

Roberto J. López-SastreUniversity of Alcalá

Alcaá de Henares, [email protected]

ABSTRACT

Avisual-relational knowledge graph (KG) is amulti-relational graphwhose entities are associated with images. We introduce Image-Graph1, a KG with 1,330 relation types, 14,870 entities, and 829,931images. Visual-relational KGs lead to novel probabilistic query typeswhere images are treated as first-class citizens. Both the predictionof relations between unseen images and multi-relational imageretrieval can be formulated as query types in a visual-relationalKG. We approach the problem of answering such queries with anovel combination of deep convolutional networks and models forlearning knowledge graph embeddings. The resulting models cananswer queries such as “How are these two unseen images related toeach other?" We also explore a zero-shot learning scenario where animage of an entirely new entity is linked with multiple relations toentities of an existing KG. The multi-relational grounding of unseenentity images into a knowledge graph serves as the description ofsuch an entity. We conduct experiments to demonstrate that theproposed deep architectures in combination with KG embeddingobjectives can answer the visual-relational queries efficiently andaccurately.ACM Reference Format:

Daniel Oñoro-Rubio,Mathias Niepert, Alberto García-Durán, Roberto González-Sánchez, and Roberto J. López-Sastre. 2018. Representation Learning forVisual-Relational Knowledge Graphs. In Proceedings of Deep Learning Day(KDD’2018 Workshop). ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

1 INTRODUCTION

Several application domains can bemodeledwith knowledge graphswhere entities are represented by nodes, object attributes by nodeattributes, and relationships between entities by directed edges be-tween the nodes. For instance, a product recommendation systemcan be represented as a knowledge graph where nodes representcustomers and products and where typed edges represent customerreviews and purchasing events. In the medical domain, there areseveral knowledge graphs that model diseases, symptoms, drugs,genes, and their interactions (cf. [2, 36, 51]). Increasingly, entities

1Project URL: https://github.com/nle-ml/mmkb.git.

Unpublished working draft. Not for distribution.

KDD’2018 Workshop, August, 2018, London, England2018. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn

Tokyo Japan

Sensō-ji

Gotoh Museum Murasaki Shikibu

captialOflo

cate

dIn

hasArtAbout born

In

locatedIn

Figure 1: Visual-relational knowledge graph.

in these knowledge graphs are associated with visual data. Forinstance, in the online retail domain, there are product and advertis-ing images and in the medical domain, there are patient-associatedimaging data sets (MRIs, CTs, and so on).

The ability of knowledge graphs to compactly represent a do-main, its attributes, and relations make them an important com-ponent of numerous AI systems. KGs facilitate the integration,organization, and retrieval of structured data and support variousforms of reasoning. In recent years KGs have been playing an in-creasingly crucial role in fields such as question answering [6, 9],language modeling [1], and text generation [40]. Even though thereis a large body of work on learning and reasoning in KGs, the set-ting of visual-relational KGs, where entities are associated withvisual data, has not received much attention. A visual-relationalKG represents entities, relations between these entities, and a largenumber of images associated with the entities (see Figure 1 foran example). While ImageNet [10] and the VisualGenome [23]datasets are based on KGs such as WordNet they are predominantlyused as either an object classification data set as in the case of Ima-geNet or to facilitate scene understanding in a single image. WithImageGraph, we propose the problem of reasoning about visualconcepts across a large set of images organized in a knowledgegraph.

2018-06-29 17:56. Page 1 of 1–10.

Page 2: Representation Learning for Visual-Relational Knowledge GraphsDaniel Oñoro-Rubio NEC Labs Europe Heidelberg, Germany daniel.onoro@neclab.eu Mathias Niepert NEC Labs Europe Heidelberg,

Unpublishedworkingdraft.

Not for distribution.

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

KDD’2018 Workshop, August, 2018, London, England D. Oñoro-Rubio et al.

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

Table 1: Statistics of the knowledge graphs used in this paper.

Entities Relations Triples Images|E | |R | Train Valid Test Train Valid Test

ImageNet [10] 21,841 18 - 14,197,122VisualGenome [23] 75,729 40,480 1,531,448 108,077FB15k [7] 14,951 1,345 483,142 50,000 59,071 0 0 0ImageGraph 14,870 1,330 460,406 47,533 56,071 411,306 201,832 216,793

Japan Football Michael Jackson Madrid The Simpson Drummer

Figure 2: Image samples for some entities of ImageGraph.

?

locatedIn?

New entity Japan

?

(1)

(2)

(4)

New entity

?(3)

Figure 3: Set of query types.

The core idea is to treat images as first-class citizens both inthe KG and in relational KG completion queries. In combinationwith the multi-relational structure of a KG, numerous more com-plex queries are possible. The main objective of our work is tounderstand to what extent visual data associated with entities ofa KG can be used in conjunction with deep learning methods toanswer visual-relational queries. Allowing images to be argumentsof queries facilitates numerous novel query types. In Figure 3 welist some of the query types we address in this paper. In order toanswer these queries, we built both on KG embedding methodsas well as deep representation learning approaches for visual data.There has been a flurry of machine learning approaches tailoredto specific problems such as link prediction in knowledge graphs.Examples are knowledge base factorization and embedding ap-proaches [7, 18, 30, 32] and random-walk based ML models [14, 24].We combine these approaches with deep neural networks to facili-tate visual-relational query answering.

There are numerous application domains that could benefit fromquery answering in visual KGs. For instance, in online retail, visualrepresentations of novel products could be leveraged for zero-shotproduct recommendations. Crucially, instead of only being ableto retrieve similar products, a visual-relational KG would supportthe prediction of product attributes and more specifically whatattributes customers might be interested in. For instance, in thefashion industry visual attributes are crucial for product recom-mendations [26, 41, 48, 49]. In general, we believe that being ableto ground novel visual concepts into an existing KG with attributesand various relation types is a reasonable approach to zero-shotlearning.

We make the following contributions. First, we introduce Image-Graph, a visual-relational KG with 1,330 relations where 829,931images are associated with 14,870 different entities. Second, weintroduce a new set of visual-relational query types. Third, we pro-pose a novel set of neural architectures and objectives that we usefor answering these novel query types. This is the first time thatdeep CNNs and KG embedding learning objectives are combinedinto a joint model. Fourth, we show that the proposed class of deepneural networks are also successful for zero-shot learning, thatis, creating relations between entirely unseen entities and the KGusing only visual data at query time.

2 RELATEDWORK

We discuss the relation of our contributions to previous work withan emphasis on object detection, scene understanding, existing datasets, and zero-shot learning.

Relational and Visual Data. Answering queries in a visual-relationalknowledge graph is our main objective. Previous work on com-bining relational and visual data has focused on object detection[13, 15, 25, 29, 37] and scene recognition [11, 34, 38, 45, 52] which arerequired for more complex visual-relational reasoning. Recent yearshave witnessed a surge in reasoning about human-object, object-object, and object-attribute relationships [8, 12, 13, 17, 19, 28, 54, 56].

2018-06-29 17:56. Page 2 of 1–10.

Page 3: Representation Learning for Visual-Relational Knowledge GraphsDaniel Oñoro-Rubio NEC Labs Europe Heidelberg, Germany daniel.onoro@neclab.eu Mathias Niepert NEC Labs Europe Heidelberg,

Unpublishedworkingdraft.

Not for distribution.

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

Representation Learning for Visual-RelationalKnowledge Graphs KDD’2018 Workshop, August, 2018, London, England

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

100 101 102 103

Relation types

101

103

Trip

le c

ount

award_nomineeprofessiondisease

production_company

tv_actor

ingredient

(a)

Symmetric

4%

Asymmetric

88%

Others

8%

(b)

Figure 4: 4a plots the distribution of relation type frequencies. The y-axis represents the number of occurrences and x-axis

the relation type index. 4b shows the most common relation types.

Table 2: Relation samples collected for symmetric, asymmetric and others relation types.

Relation type Example (h, r , t )

Symmetric (EmmaThompson,sibling,SophieThompson)(SophieThompson,sibling,EmmaThompson)

Asymmetric (Non-profitorganization,company_type,ApacheSoftwareFoundation)(Statistics,students_majoring,PhD)

Others(StarWars,film_series,StarWars)

(StarWarsEpisodeI:ThePhantomMenace,film_series,StarWars)(StarWarsEpisodeII:AttackoftheClones,film_series,StarWars)

The VisualGenome project [23] is a knowledge base that integrateslanguage and vision modalities. The project provides a knowledgegraph, based on WordNet, which provides annotations of cate-gories, attributes, and relation types for each image. Recent workhas used the dataset to focus on scene understanding in singleimages. For instance, Lu et al. [27] proposed a model to detectrelation types between objects depicted in an image by inferringsentences such as “man riding bicycle." Veit et al. [49] propose asiamese CNN to learn a metric representation on pairs of textileproducts so as to learn which products have similar styles. Thereis a large body of work on metric learning where the objective isto generate image embeddings such that a pairwise distance-basedloss is minimized [4, 33, 39, 43, 50]. Recent work has extended thisidea to directly optimize a clustering quality metric [44]. Zhou et al.propose a method based on a bipartite graph that links depictionsof meals to its ingredients. Johnson et al. [21] propose to use theVisualGenome data to recover images from text queries. Image-Graph is different from these data sets in that the relation typeshold between different images and image annotated entities. Thisdefines a novel class of problems where one seeks to answer queriessuch as “How are these two images related?" With this work, weaddress problems ranging from predicting the relation types forimage pairs to multi-relational image retrieval.

Zero-shot Learning. We focus on exploring ways in which KGscan be used to find relationships between unseen images, that is,images depicting novel entities that are not part of the KG, andvisual depictions of known KG entities. This is a form of zero-shotlearning (ZSL) where the objective is to generalize to novel visual

concepts without seeing any training examples. Generally, ZSLmethods (e.g. [35, 55]) rely on an underlying embedding space, suchas attributes, in order to recognize the unseen categories. However,in this paper, we do not assume the availability of such a commonembedding space but we assume the existence of an external visual-relational KG. Similar to our approach, when this explicit knowledgeis not encoded in the underlying embedding space, other works relyon finding the similarities through the linguistic space (e.g. [3, 27]),leveraging distributional word representations so as to capture anotion of taxonomy and similarity. But these works address sceneunderstanding in a single image, i.e. these models are able to detectthe visual relationships in one given image. On the contrary, ourmodels are able to find relationships between different images andentities.

3 IMAGEGRAPH: A VISUAL-RELATIONAL

KNOWLEDGE GRAPH

ImageGraph is a visual-relational KG whose relational structureis based on that of Freebase [5]. More specifically, it is based onFB15k, a subset of FreeBase, which has been used as a benchmarkdata set [30]. Since FB15k does not include visual data, we performthe following steps to enrich the KG entities with image data. Weimplemented a web crawler that is able to parse query results forthe image search engines Google Images, Bing Images, and YahooImage Search. To minimize the amount of noise due to polyse-mous entity labels (for example, there are more than 100 Freebaseentities with the text label “Springfield") we extracted, for each

2018-06-29 17:56. Page 3 of 1–10.

Page 4: Representation Learning for Visual-Relational Knowledge GraphsDaniel Oñoro-Rubio NEC Labs Europe Heidelberg, Germany daniel.onoro@neclab.eu Mathias Niepert NEC Labs Europe Heidelberg,

Unpublishedworkingdraft.

Not for distribution.

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

KDD’2018 Workshop, August, 2018, London, England D. Oñoro-Rubio et al.

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

0

200

400

600

800

1000

1200

Enti

ty C

ount

(a)

101 103

Entities

100

101

102

103

104

Trip

le c

ount

United States of America

English Language

Executive Producer

University of Oxford

Parlophone

Vigor Shipyards

(b)

Figure 5: 5b depicts the 10 most frequent entity types. 5a plots the entity distribution. The y-axis represents the total number

of occurrences and the x-axis the entity indices.

entity in FB15k, all Wikipedia URIs from the 1.9 billion triple Free-base RDF dump. For instance, for Springfield, Massachusetts, weobtained such URIs as Springfield_(Massachusetts,United_States) and Springfield_(MA). These URIs were processed andused as search queries for disambiguation purposes. We used thecrawler to download more than 2.4M images (more than 462Gb ofdata). We removed corrupted, low quality, and duplicate imagesand we used the 25 top images returned by each of the image searchengines whenever there were more than 25 results. The imageswere scaled to have a maximum height or width of 500 pixels whilemaintaining their aspect ratio. This resulted in 829,931 images asso-ciated with 14,870 different entities (55.8 images per entity). Afterfiltering out triples where either the head or tail entity could not beassociated with an image, the visual KG consists of 564,010 triplesexpressing 1,330 different relation types between 14,870 entities.We provide three sets of triples for training, validation, and test-ing plus three more image splits also for training, validation andtest. Table 1 lists the statistics of the resulting visual KG. Any KGderived from FB15k such as FB15k-237[46] can also be associatedwith the crawled images. Since providing the images themselveswould violate copyright law, we provide the code for the distributedcrawler and the list of image URLs crawled for the experiments inthis paper2.

The distribution of relation types is depicted in the Figure4a. We show in logarithmic scale the number of times that eachrelation occurs in the KG. We observe how relationships likeaward_nominee or profession occur quite frequently while oth-ers such as ingredient occur just a few times. 4% of the relationtypes are symmetric, 88% are asymmetric, and 8% are others (seeFigure 4b). Where a symmetric relation implies (h, r , t )⇒ (t , r ,h),an asymmetric relation satisfy (h, r , t ) ⇒ (t ,¬r ,h), and in otherswe group those relations that are neither symmetric or asymmetric.Table 2 shows some qualitative samples of the relations types. Thereare 585 distinct entity types such as Person, Athlete, and City. In

2ImageGraph crawler and URLs: https://github.com/robegs/imageDownloader.

Figure 5a we plot he most frequent entity types. In the Figure 4bwe plot the entity frequencies and some example entities.

In Table 1 are listed the statistics of ImageGraph and similarworks. First, we would like remark the differences between Image-Graph and Visual Genome data (VGD) [23]. With ImageGraphwe address the problem of learning a representation that directlymaps the relations between images, without the involvement oftext. On a high level, we focus on performing rankings as responsesto probabilistic queries. In some sense, this is similar to informationretrieval except that in our proposed work, images are first-classcitizens. In contrast, VGD is focused on modeling relations betweenobjects in images and the relations are expressed in natural lan-guage. Second, the main differences between ImageGraph andImageNet are the following. ImageNet is based on WordNet alexical database where synonymous words from the same lexicalcategory are grouped into synsets. There are 18 relations expressingconnections between synsets. In Freebase, on the other hand, thereare two orders of magnitudes more relations. In FB15k, the subsetwe focus on, there are 1,345 relations expressing location of places,positions of basketball players, and gender of entities. Moreover,entities in ImageNet exclusively represent entity types such asCats and Cars whereas entities in FB15k are either entity types orinstances of entity types such as Albert Einstein and Paris. Thisrenders the computer vision problems associated with ImageGraphmore challenging than those for existing datasets. Moreover, withImageGraph the focus is on learning relational ML models thatincorporate visual data both during learning and at query time.

4 REPRESENTATION LEARNING FOR

VISUAL-RELATIONAL GRAPHS

A knowledge graph (KG) K is given by a set of triples T, that is,statements of the form (h, r, t), where h, t ∈ E are the head and tailentities, respectively, and r ∈ R is a relation type. Figure 1 depicts asmall fragment of a KG with relations between entities and imagesassociated with the entities. Prior work has not included image dataand has, therefore, focused on the following two types of queries.

2018-06-29 17:56. Page 4 of 1–10.

Page 5: Representation Learning for Visual-Relational Knowledge GraphsDaniel Oñoro-Rubio NEC Labs Europe Heidelberg, Germany daniel.onoro@neclab.eu Mathias Niepert NEC Labs Europe Heidelberg,

Unpublishedworkingdraft.

Not for distribution.

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

Representation Learning for Visual-RelationalKnowledge Graphs KDD’2018 Workshop, August, 2018, London, England

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

First, the query type (h, r?, t) asks for the relations between a givenpair of head and tail entities. Second, the query types (h, r, t?) and(h?, r, t), asks for entities correctly completing the triple. The latterquery type is often referred to as knowledge base completion. Here,we focus on queries that involve visual data as query objects, thatis, objects that are either contained in the queries, the answers tothe queries, or both.

4.1 Visual-Relational Query Answering

When entities are associated with image data, several completelynovel query types are possible. Figure 3 lists the query types wefocus on in this paper. We refer to images used during training asseen and all other images as unseen.

(1) Given a pair of unseen images for which we do not knowtheir KG entities, determine the unknown relations betweenthe underlying entities.

(2) Given an unseen image, for which we do not know the un-derlying KG entity, and a relation type, determine the seenimages that complete the query.

(3) Given an unseen image of an entirely new entity that is notpart of the KG, and an unseen image for which we do notknow the underlying KG entity, determine the unknownrelations between the two underlying entities.

(4) Given an unseen image of an entirely new entity that is notpart of the KG, and a known KG entity, determine the un-known relations between the two entities.

For each of these query types, the sought-after relations betweenthe underlying entities have never been observed during training.Query types (3) and (4) are a form of zero-shot learning since neitherthe new entity’s relationships with other entities nor its imageshave been observed during training. These considerations illustratethe novel nature of the visual query types. The machine learningmodels have to be able to learn the relational semantics of the KGand not simply a classifier that assigns images to entities. Thesequery types are also motivated by the fact that for typical KGs thenumber of entities is orders of magnitude greater than the numberof relations.

4.2 Deep Representation Learning for Query

Answering

We first discuss the state of the art of KG embedding methods andtranslate the concepts to query answering in visual-relational KGs.Let rawi be the raw feature representation for entity i ∈ R and letf and д be differentiable functions. Most KG completion methodslearn an embedding of the entities in a vector space via some scoringfunction that is trained to assign high scores to correct triples andlow scores to incorrect triples. Scoring functions have often theform fr(eh, et) where r is a relation, eh and et are d-dimensionalvectors (the embeddings of the head and tail entities, respectively),and where ei = д(rawi) is an embedding function that maps theraw input representation of entities to the embedding space. In thecase of KGs without additional visual data, the raw representationof an entity is simply its one-hot encoding.

Existing KG completion methods use the embedding functionд(rawh) = raw⊺

iW whereW is a |E |×d matrix, and differ only intheir scoring function, that is, in the way that the embedding vectors

of the head and tail entities are combined with the parameter vectorϕr:• Difference (TransE[7]): fr(eh, et) = −||eh +ϕr − et | |2 whereϕr is a d-dimensional vector;• Multiplication (DistMult[53]): fr(eh, et) = (eh ∗ et) · ϕrwhere ∗ is the element-wise product and ϕr a d-dimensionalvector;• Circular correlation (HolE[31]): fr(eh, et) = (eh ⋆ et) · ϕrwhere [a⋆b]k =

∑d−1i=0 aib(i+k ) mod d andϕr ad-dimensional

vector; and• Concatenation: fr(eh, et) = (eh ⊙ et) · ϕr where ⊙ is theconcatenation operator and ϕr a 2d-dimensional vector.

For each of these instances, the matrixW (storing the entityembeddings) and the vectors ϕr are learned during training. Ingeneral, the parameters are trained such that fr(eh, et) is high fortrue triples and low for triples assumed not to hold in the KG. Thetraining objective is often based on the logistic loss, which has beenshown to be superior for most of the composition functions [47],

(1)

minΘ

∑(h,r,t) ∈Tpos

log(1 + exp(−fr(eh, et))

+∑

(h,r,t)∈Tneg

log(1 + exp(fr(eh, et))) + λ | |Θ| |22,

where Tpos and Tneg are the set of positive and negative trainingtriples, respectively, Θ are the parameters trained during learningand λ is a regularization hyperparameter. For the above objective,a process for creating corrupted triples Tneg is required. This ofteninvolves sampling a random entity for either the head or tail entity.To answer queries of the types (h, r, t?) and (h?, r, t) after training,we form all possible completions of the queries and compute aranking based on the scores assigned by the trained model to thesecompletions.

For the queries of type (h, r?, t) one typically uses the softmaxactivation in conjunction with the categorical cross-entropy loss,which does not require negative triples

minΘ

∑(h,r,t)∈Tpos

− log(

exp(fr(eh, et))∑r∈R exp(fr(eh, et))

)+ λ | |Θ| |22 , (2)

where Θ are the parameters trained during learning.For visual-relational KGs, the input consists of raw image data

instead of the one-hot encodings of entities. The approach we pro-pose builds on the ideas and methods developed for KG completion.Instead of having a simple embedding function g that multipliesthe input with a weight matrix, however, we use deep convolu-tional neural networks to extract meaningful visual features fromthe input images. For the composition function f we evaluate thefour operations that were used in the KG completion literature:difference, multiplication, concatenation, and circular correlation.Figure 6a depicts the basic architecture we trained for query an-swering. The weights of the parts of the neural network responsiblefor embedding the raw image input, denoted by g, are tied. We alsoexperimented with additional hidden layers indicated by the dasheddense layer. The composition operation op is either difference, mul-tiplication, concatenation, or circular correlation. To the best ofour knowledge, this is the first time that KG embedding learning

2018-06-29 17:56. Page 5 of 1–10.

Page 6: Representation Learning for Visual-Relational Knowledge GraphsDaniel Oñoro-Rubio NEC Labs Europe Heidelberg, Germany daniel.onoro@neclab.eu Mathias Niepert NEC Labs Europe Heidelberg,

Unpublishedworkingdraft.

Not for distribution.

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

KDD’2018 Workshop, August, 2018, London, England D. Oñoro-Rubio et al.

639

640

641

642

643

644

645

646

647

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

666

667

668

669

670

671

672

673

674

675

676

677

678

679

680

681

682

683

684

685

686

687

688

689

690

691

692

693

694

695

696

op

f

g

gr

VGG16

VGG16

256256

(a)

r?

Sensō-ji Japan

r?VGG16 VGG16

DistMult

(b)

Figure 6: (a) Proposed architecture for query answering. (b) Illustration of two possible approaches to visual-relational query

answering. One can predict relation types between two images directly (green arrow; our approach) or combine an entity

classifier with a KB embedding model for relation prediction (red arrows; baseline VGG16+DistMult).

and deep CNNs have been combined for visual-relationsl queryanswering.

5 EXPERIMENTS

We conduct a series of experiments to evaluate our proposed ap-proach to visual-relational query answering. First, we describe theexperimental set-up that applies to all experiments. Second, we re-port and interpret results for the different types of visual-relationalqueries.

5.1 General Set-up

We used Caffe, a deep learning framework [20] for designing, train-ing, and evaluating the proposed models. The embedding functiong is based on the VGG16 model introduced in [42]. We pre-trainedthe VGG16 on the ILSVRC2012 data set derived from ImageNet [10]and removed the softmax layer of the original VGG16. We added a256-dimensional layer after the last dense layer of the VGG16. Theoutput of this layer serves as the embedding of the input images.The reason for reducing the embedding dimensionality from 4096to 256 is motivated by the objective to obtain an efficient and com-pact latent representation that is feasible for KGs with billion ofentities. For the composition function f, we performed either of thefour operations difference, multiplication, concatenation, and cir-cular correlation. We also experimented with an additional hiddenlayer with ReLu activation. Figure 6a depicts the generic networkarchitecture. The output layer of the architecture has a softmaxor sigmoid activation with cross-entropy loss. We initialized theweights of the newly added layers with the Xavier method [16].

We used a batch size of 45 which was the maximal possiblefitting into GPU memory. To create the training batches, we samplea random triple uniformly at random from the training triples.For the given triple, we randomly sample one image for the headand one for the tail from the set of training images. We appliedSGD with a learning rate of 10−5 for the parameters of the VGG16and a learning rate of 10−3 for the remaining parameters. It iscrucial to use two different learning rates since the large gradientsin the newly added layers would lead to unreasonable changesin the pretrained part of the network. We set the weight decay to5×10−4. We reduced the learning rate by a factor of 0.1 every 40,000iterations. Each of the models was trained for 100,000 iterations.

Table 3: Results for the relation prediction problem.

Model Median Hits@1 Hits@10 MRRVGG16+DistMult 94 6.0 11.4 0.087Prob. Baseline 35 3.7 26.5 0.104DIFF 11 21.1 50.0 0.307MULT 8 15.5 54.3 0.282CAT 6 26.7 61.0 0.378

DIFF+1HL 8 22.6 55.7 0.333MULT+1HL 9 14.8 53.4 0.273CAT+1HL 6 25.3 60.0 0.365

Since the answers to all query types are either rankings of imagesor rankings of relations, we utilize metrics measuring the qualityof rankings. In particular, we report results for hits@1 (hits@10,hits@100) measuring the percentage of times the correct relationwas ranked highest (ranked in the top 10, top 100). We also com-pute the median of the ranks of the correct entities or relations andthe Mean Reciprocal Rank (MRR) for entity and relation rankings,respectively, defined as follows:

MRR =1

2|T|

∑(h,r,t)∈T

(1

rankimg(h)+

1rankimg(t)

)(3)

MRR =1|T|

∑(h,r,t)∈T

1rankr

, (4)

where T is the set of all test triples, rankr is the rank of the correctrelation, and rankimg(h) is the rank of the highest ranked image ofentity h. For each query, we remove all triples that are also correctanswers to the query from the ranking. All experiments were runon commodity hardware with 128GB RAM, a single 2.8 GHz CPU,and a NVIDIA 1080 Ti.

5.2 Visual Relation Prediction

Given a pair of unseen images we want to determine the relationsbetween their underlying unknown entities. This can be expressedwith (imgh, r?, imgt). Figure 3(1) illustrates this query type whichwe refer to as visual relation prediction. We train the deep architec-tures using the training and validation triples and images, respec-tively. For each triple (h, r, t) in the training data set, we sample

2018-06-29 17:56. Page 6 of 1–10.

Page 7: Representation Learning for Visual-Relational Knowledge GraphsDaniel Oñoro-Rubio NEC Labs Europe Heidelberg, Germany daniel.onoro@neclab.eu Mathias Niepert NEC Labs Europe Heidelberg,

Unpublishedworkingdraft.

Not for distribution.

697

698

699

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

Representation Learning for Visual-RelationalKnowledge Graphs KDD’2018 Workshop, August, 2018, London, England

755

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

800

801

802

803

804

805

806

807

808

809

810

811

812

one training image uniformly at random for both the head and thetail entity. We use the architecture depicted in Figure 6a with thesoftmax activation and the categorical cross-entropy loss. For eachtest triple, we sample one image uniformly at random from the testimages of the head and tail entity, respectively. We then use thepair of images to query the trained deep neural networks. To geta more robust statistical estimate of the evaluation measures, werepeat the above process three times per test triple. Again, none ofthe test triples and images are seen during training nor are any ofthe training images used during testing. Computing the answer toone query takes the model 20 ms.

We compare the proposed architectures to two different base-lines: one based on entity classification followed by a KB embeddingmethod for relation prediction (VGG16+DistMult), and a probabilis-tic baseline (Prob. Baseline). The entity classification baseline con-sists of fine-tuning a pretrained VGG16 to classify images into the14, 870 entities of ImageGraph. To obtain the relation type rankingat test time, we predict the entities for the head and the tail usingthe VGG16 and then use the KB embedding method DistMult[53]to return a ranking of relation types for the given (head, tail) pair.DistMult is a KB embedding method that achieves state of the artresults for KB completion on FB15k [22]. Therefore, for this exper-iment we just substitute the original output layer of the VGG16pretrained on ImageNet with a new output layer suitable for ourproblem. To train, we join the train an validation splits, we setthe learning rate to 10−5 for all the layers and we train followingthe same strategy that we use in all of our experiments. Once thesystem is trained, we test the model by classifying the entities of theimages in the test set. To train DistMult, we sample 500 negativestriples for each positive triple and used an embedding size of 100.Figure 6b illustrates the VGG16+DistMult baseline and contrastsit with our proposed approach. The second baseline (probabilisticbaseline) computes the probability of each relation type using theset of training and validation triples. The baseline ranks relationtypes based on these prior probabilities.

Table 3 lists the results for the two baselines and the differentproposed architectures. The probabilistic baseline outperforms theVGG16+DistMult baseline in 3 of the metrics. This is due to thehighly skewed distribution of relation types in the training, valida-tion, and test triples. A small number of relation types makes up alarge fraction of triples. Figure 4a and 5b plots the counts of relationtypes and entities. Moreover, despite DistMult achieving a hits@1value of 0.46 for the relation prediction problem between entitypairs the baseline VGG16+DistMult performs poorly. This is due tothe poor entity classification performance of the VGG (accurracy:0.082, F1: 0.068). In the remainder of the experiments, therefore,we only compare to the probabilistic baseline. In the lower part ofTable 3, we lists the results of the experiments. DIFF, MULT, andCAT stand for the different possible composition operations. Weomitted the composition operation circular correlation since wewere not able to make the corresponding model converge, despitetrying several different optimizers and hyperparameter settings.The post-fix 1HL stands for architectures where we added an addi-tional hidden layer with ReLu activation before the softmax. Theconcatenation operation clearly outperforms the multiplicationand difference operations. This is contrary to findings in the KGcompletion literature where MULT and DIFF outperformed the

Table 4: Results for the multi-relational image retrieval

problem.

Median Hits@100 MRRModel Head Tail Head Tail Head TailBaseline 6504 2789 11.9 18.4 0.065 0.115

DIFF 1301 877 19.6 26.3 0.051 0.094MULT 1676 1136 16.8 22.9 0.040 0.080CAT 1022 727 21.4 27.5 0.050 0.087DIFF+1HL 1644 1141 15.9 21.9 0.045 0.085MULT+1HL 2004 1397 14.6 20.5 0.034 0.069CAT+1HL 1323 919 17.8 23.6 0.042 0.080CAT-SIG 814 540 23.2 30.1 0.049 0.082

concatenation operation. The models with the additional hiddenlayer did not perform better than their shallower counterparts withthe exception of the DIFF model. We hypothesize that this is dueto difference being the only linear composition operation, benefit-ing from an additional non-linearity. Each of the proposed modelsoutperforms the baselines.

5.3 Multi-Relational Image Retrieval

Given an unseen image, for which we do not know the underlyingKG entity, and a relation type, we want to retrieve existing imagesthat complete the query. If the image for the head entity is given,we return a ranking of images for the tail entity; if the tail entityimage is given we return a ranking of images for the head entity.This problem corresponds to query type (2) in Figure 3. Note thatthis is equivalent to performing multi-relational metric learningwhich, to the best of our knowledge, has not been done before.We performed experiments with each of the three compositionfunctions f and for two different activation/loss functions. First,we used the models trained with the softmax activation and thecategorical cross-entropy loss to rank images. Second, we took themodels trained with the softmax activation and substituted thesoftmax activation with a sigmoid activation and the correspondingbinary cross-entropy loss. For each training triple (h, r, t) we thencreated two negative triples by sampling once the head and oncethe tail entity from the set of entities. The negative triples are thenused in conjunction with the binary cross-entropy loss of equation 1to refine the pretrained weights. Directly training a model withthe binary cross-entropy loss was not possible since the model didnot converge properly. Pretraining with the softmax activation andcategorical cross-entropy loss was crucial to make the binary losswork.

During testing, we used the test triples and ranked the imagesbased on the probabilities returned by the respective models. Forinstance, given the query (imgSenso-ji, locatedIn, imgt?), we sub-stituted imgt? with all training and validation images, one at a time,and ranked the images according to the probabilities returned bythe models. We use the rank of the highest ranked image belongingto the true entity (here: Japan) to compute the values for the eval-uation measures. We repeat the same experiment three times (eachtime randomly sampling the images) and report average values.Again, we compare the results for the different architectures with

2018-06-29 17:56. Page 7 of 1–10.

Page 8: Representation Learning for Visual-Relational Knowledge GraphsDaniel Oñoro-Rubio NEC Labs Europe Heidelberg, Germany daniel.onoro@neclab.eu Mathias Niepert NEC Labs Europe Heidelberg,

Unpublishedworkingdraft.

Not for distribution.

813

814

815

816

817

818

819

820

821

822

823

824

825

826

827

828

829

830

831

832

833

834

835

836

837

838

839

840

841

842

843

844

845

846

847

848

849

850

851

852

853

854

855

856

857

858

859

860

861

862

863

864

865

866

867

868

869

870

KDD’2018 Workshop, August, 2018, London, England D. Oñoro-Rubio et al.

871

872

873

874

875

876

877

878

879

880

881

882

883

884

885

886

887

888

889

890

891

892

893

894

895

896

897

898

899

900

901

902

903

904

905

906

907

908

909

910

911

912

913

914

915

916

917

918

919

920

921

922

923

924

925

926

927

928

...

succeededBy

genreOfwonAward

directedBy

...3

2

159407

106817

Figure 7: Example queries and results for the multi-relational image retrieval problem.

Table 5: Results for the zero-shot learning experiments.

Median Hits@1 Hits@10 MRRH T H T H T H T

Zero-Shot Query (3)Base 34 31 1.9 2.3 18.2 28.7 0.074 0.089CAT 8 7 19.1 22.4 54.2 57.9 0.306 0.342

Zero-Shot Query (4)Base 9 5 13.0 22.6 52.3 64.8 0.251 0.359CAT 5 3 26.9 33.7 62.5 70.4 0.388 0.461

a probabilistic baseline. For the baseline, however, we compute adistribution of head and tail entities for each of the relation types.For example, for the relation type locatedIn we compute two dis-tributions, one for head and one for tail entities. We used the samemeasures as in the previous experiment to evaluate the returnedimage rankings.

Table 4 lists the results of the experiments. As for relation pre-diction, the best performing models are based on the concatenationoperation, followed by the difference and multiplication operations.The architectures with an additional hidden layer do not improvethe performance. We also provide the results for the concatenation-based model with softmax activation where we refined the weightsusing a sigmoid activation and negative sampling as described be-fore. This model is the best performing model. All neural networkmodels are significantly better than the baseline with respect tothe median and hits@100. However, the baseline has slightly su-perior results for the MRR. This is due to the skewed distributionof entities and relations in the KG (see Figure 5b and Figure 4a).This shows once more that the baseline is highly competitive forthe given KG. Figure 7 visualizes the answers the CAT-SIG modelprovided for a set of four example queries. For the two queries onthe left, the model performed well and ranked the correct entity inthe top 3 (green frame). The examples on the right illustrate queriesfor which the model returned an inaccurate ranking. To performquery answering in a highly efficient manner, we precomputed andstored all image embeddings once, and only compute the scoringfunction (involving the composition operation and a dot productwith ϕr) at query time. Answering one multi-relational image re-trieval query (which would otherwise require 613,138 individualqueries, one per possible image) took only 90 ms.

hasCrewJobhasGenre

hasProfession

hasNutrientfilmHasLocationpeopleBornHere...

TaxonomyHasEntry Card GameLibrary of Congress

Back to the Future Special Effects Supervisor

Figure 8: Example results for zero-shot learning. For each

pair of images the top three relation types (as ranked by the

CAT model) are listed. For the pair of images at the top, the

first relation type is correct. For the pair of images at the

bottom, the correct relation type TaxonomyHasEntry is not

among the top three relation types.

5.4 Zero-Shot Visual Relation Prediction

The last set of experiments addresses the problem of zero-shotlearning via visual relation prediction. For both query types, we aregiven an new image of an entirely new entity that is not part of theKG. The first query type asks for relations between the given imageand an unseen image for which we do not know the underlyingKG entity. The second query type asks for the relations betweenthe given image and an existing KG entity. We believe that creat-ing multi-relational links to existing KG entities is a reasonableapproach to zero-shot learning since an unseen entity or categoryis integrated into an existing KG. The relations to existing visualconcepts and their attributes provide a characterization of the newentity/category. This problem cannot be addressed with KG em-bedding methods since entities need to be part of the KG duringtraining for these models to work.

For the zero-shot experiments, we generated a new set of train-ing, validation, and test triples. We randomly sampled 500 entitiesthat occur as head (tail) in the set of test triples. We then removedall training and validation triples whose head or tail is one of these1000 entities. Finally, we only kept those test triples with one ofthe 1000 entities either as head or tail but not both. For query type(4) where we know the target entity, we sample 10 of its imagesand use the models 10 times to compute a probability. We use theaverage probabilities to rank the relations. For query type (3) we

2018-06-29 17:56. Page 8 of 1–10.

Page 9: Representation Learning for Visual-Relational Knowledge GraphsDaniel Oñoro-Rubio NEC Labs Europe Heidelberg, Germany daniel.onoro@neclab.eu Mathias Niepert NEC Labs Europe Heidelberg,

Unpublishedworkingdraft.

Not for distribution.

929

930

931

932

933

934

935

936

937

938

939

940

941

942

943

944

945

946

947

948

949

950

951

952

953

954

955

956

957

958

959

960

961

962

963

964

965

966

967

968

969

970

971

972

973

974

975

976

977

978

979

980

981

982

983

984

985

986

Representation Learning for Visual-RelationalKnowledge Graphs KDD’2018 Workshop, August, 2018, London, England

987

988

989

990

991

992

993

994

995

996

997

998

999

1000

1001

1002

1003

1004

1005

1006

1007

1008

1009

1010

1011

1012

1013

1014

1015

1016

1017

1018

1019

1020

1021

1022

1023

1024

1025

1026

1027

1028

1029

1030

1031

1032

1033

1034

1035

1036

1037

1038

1039

1040

1041

1042

1043

1044

only use one image sampled randomly. As with previous experi-ments, we repeated procedure three times and averaged the results.For the baseline, we compute the probabilities of relation in thetraining and validation set (for query type (3)) and the probabili-ties of relations conditioned on the target entity (for query type(4)). Again, these are very competitive baselines due to the skeweddistribution of relations and entities. Table 5 lists the results ofthe experiments. The model based on the concatenation operation(CAT) outperforms the baseline and performs surprisingly well.The deep models are able to generalize to unseen images sincetheir performance is comparable to the performance in the relationprediction task (query type (1)) where the entity was part of theKG during training (see Table 3). Figure 8 depicts example queriesfor the zero-shot query type (3). For the first query example, theCAT model ranked the correct relation type first (indicated by thegreen bounding box). The second example is more challenging andthe correct relation type was not part of the top 10 ranked relationtypes.

6 CONCLUSION

KGs are at the core of numerous AI applications. Research hasfocused either on KG completion methods working only on therelational structure or on scene understanding in a single image.We present a novel visual-relational KG where the entities areenriched with visual data. We proposed several novel query typesand introduce neural architectures suitable for probabilistic queryanswering. We propose a novel approach to zero-shot learning asthe problem of visually mapping an image of an entirely new entityto a KG.

We have observed that for some relation types, the proposedmodels tend to learn a fine-grained visual type that typically occursas the head or tail of the relation type. In these cases, conditioningon either the head or tail entity does not influence the predictionsof the models substantially. This is a potential shortcoming of theproposed methods and we believe that there is a lot of room forimprovement for probabilistic query answering in visual-relationalKGs.

REFERENCES

[1] Sungjin Ahn, Heeyoul Choi, Tanel Parnamaa, and Yoshua Bengio. 2016. A NeuralKnowledge Language Model. arXiv preprint arXiv:1608.00318 (2016).

[2] Michael Ashburner, Catherine A. Ball, Judith A. Blake, David Botstein, HeatherButler, J. Michael Cherry, Allan P. Davis, Kara Dolinski, Selina S. Dwight, Janan T.Eppig, Midori A. Harris, David P. Hill, Laurie Issel-Tarver, Andrew Kasarskis,Suzanna Lewis, John C. Matese, Joel E. Richardson, Martin Ringwald, Gerald M.Rubin, and Gavin Sherlock. 2000. Gene Ontology: tool for the unification ofbiology. Nat Genet 25, 1 (2000), 25–29.

[3] J. Ba, K. Swersky, S. Fidler, and R. Salakhutdinov. 2015. Predicting deep zero-shotconvolutional neural networks using textual descriptions. In CVPR.

[4] Sean Bell and Kavita Bala. 2015. Learning visual similarity for product designwith convolutional neural networks. ACM Transactions on Graphics (TOG) 34, 4(2015), 98.

[5] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor.2008. Freebase: A Collaboratively Created Graph Database for Structuring HumanKnowledge. In SIGMOD. 1247–1250.

[6] Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. 2015. Large-scale simple question answering with memory networks. arXiv preprintarXiv:1506.02075 (2015).

[7] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Ok-sana Yakhnenko. 2013. Translating embeddings for modeling multi-relationaldata. In Advances in Neural Information Processing Systems. 2787–2795.

[8] Xinlei Chen, Abhinav Shrivastava, and Abhinav Gupta. 2013. Neil: Extracting vi-sual knowledge from web data. In Proceedings of the IEEE International Conference

on Computer Vision. 1409–1416.[9] Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jose M. F.

Moura, Devi Parikh, and Dhruv Batra. 2017. Visual Dialog. In CVPR.[10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A

Large-Scale Hierarchical Image Database. In CVPR.[11] Carl Doersch, Abhinav Gupta, and Alexei A Efros. 2013. Mid-level Visual Element

Discovery as Discriminative Mode Seeking. In Advances in Neural InformationProcessing Systems 26. 494–502.

[12] Ali Farhadi, Ian Endres, Derek Hoiem, and David A. Forsyth. 2009. Describingobjects by their attributes. In 2009 IEEE Computer Society Conference on ComputerVision and Pattern Recognition. 1778–1785.

[13] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan.2010. Object detection with discriminatively trained part-based models. IEEEtransactions on pattern analysis and machine intelligence 32, 9 (2010), 1627–1645.

[14] Matt Gardner and Tom M Mitchell. 2015. Efficient and Expressive KnowledgeBase Completion Using Subgraph Feature Extraction.. In EMNLP. 1488–1498.

[15] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. RichFeature Hierarchies for Accurate Object Detection and Semantic Segmentation. InProceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition.580–587.

[16] Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of trainingdeep feedforward neural networks. In AISTATS.

[17] Abhinav Gupta, Aniruddha Kembhavi, and Larry S. Davis. 2009. ObservingHuman-Object Interactions: Using Spatial and Functional Compatibility forRecognition. IEEE Transactions on Pattern Analysis and Machine Intelligence31 (2009), 1775–1789.

[18] Kelvin Guu, John Miller, and Percy Liang. 2015. Traversing knowledge graphs invector space. arXiv preprint arXiv:1506.01094 (2015).

[19] Hamid Izadinia, Fereshteh Sadeghi, and Ali Farhadi. 2014. Incorporating scenecontext and object layout into appearance modeling. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition. 232–239.

[20] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long,Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: ConvolutionalArchitecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 (2014).

[21] J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. A. Shamma, M. Bernstein, and L.Fei-Fei. 2015. Image Retrieval using Scene Graphs. In CVPR.

[22] Rudolf Kadlec, Ondrej Bajgar, and Jan Kleindienst. 2017. Knowledge Base Com-pletion: Baselines Strike Back. arXiv preprint arXiv:1705.10744 (2017).

[23] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, JoshuaKravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, MichaelBernstein, and Li Fei-Fei. 2016. Visual Genome: Connecting Language andVision Using Crowdsourced Dense Image Annotations. In arXiv preprintarXiv:1602.07332.

[24] Ni Lao, Tom Mitchell, and William W Cohen. 2011. Random walk inferenceand learning in a large scale knowledge base. In Proceedings of the Conference onEmpirical Methods in Natural Language Processing. Association for ComputationalLinguistics, 529–539.

[25] Yining Li, Chen Huang, Xiaoou Tang, and Chen Change Loy. 2017. Learning toDisambiguate by Asking Discriminative Questions. In ICCV.

[26] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. 2016. DeepFash-ion: Powering Robust Clothes Recognition and Retrieval With Rich Annotations.In CVPR.

[27] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei. 2016. Visual relationship detectionwith language priors. In ECCV.

[28] Tomasz Malisiewicz and Alexei A. Efros. 2009. Beyond Categories: The VisualMemex Model for Reasoning About Object Relationships. In Advances in NeuralInformation Processing Systems.

[29] Kenneth Marino, Ruslan Salakhutdinov, and Abhinav Gupta. 2017. The MoreYou Know: Using Knowledge Graphs for Image Classification. In CVPR.

[30] Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. 2016.A review of relational machine learning for knowledge graphs. Proc. IEEE 104, 1(2016), 11–33.

[31] Maximilian Nickel, Lorenzo Rosasco, and Tomaso A. Poggio. 2016. HolographicEmbeddings of Knowledge Graphs. In Proceedings of the Thirtieth Conference onArtificial Intelligence. 1955–1961.

[32] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. 2011. A three-waymodel for collective learning on multi-relational data. In Proceedings of the 28thinternational conference on machine learning (ICML-11). 809–816.

[33] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. 2016. Deepmetric learning via lifted structured feature embedding. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition. 4004–4012.

[34] Megha Pandey and Svetlana Lazebnik. 2011. Scene recognition and weaklysupervised object localization with deformable part-based models. In ComputerVision (ICCV), 2011 IEEE International Conference on. 1307–1314.

[35] B. Romera-Paredes and P. Torr. 2015. An embarrassingly simple approach tozero-shot learning. In ICML.

[36] M. Rotmensch, Y. Halpern, A. Tlimat, S. Horng, and D. Sontag. 2017. Learninga Health Knowledge Graph from Electronic Medical Records. Nature Scientific

2018-06-29 17:56. Page 9 of 1–10.

Page 10: Representation Learning for Visual-Relational Knowledge GraphsDaniel Oñoro-Rubio NEC Labs Europe Heidelberg, Germany daniel.onoro@neclab.eu Mathias Niepert NEC Labs Europe Heidelberg,

Unpublishedworkingdraft.

Not for distribution.

1045

1046

1047

1048

1049

1050

1051

1052

1053

1054

1055

1056

1057

1058

1059

1060

1061

1062

1063

1064

1065

1066

1067

1068

1069

1070

1071

1072

1073

1074

1075

1076

1077

1078

1079

1080

1081

1082

1083

1084

1085

1086

1087

1088

1089

1090

1091

1092

1093

1094

1095

1096

1097

1098

1099

1100

1101

1102

KDD’2018 Workshop, August, 2018, London, England D. Oñoro-Rubio et al.

1103

1104

1105

1106

1107

1108

1109

1110

1111

1112

1113

1114

1115

1116

1117

1118

1119

1120

1121

1122

1123

1124

1125

1126

1127

1128

1129

1130

1131

1132

1133

1134

1135

1136

1137

1138

1139

1140

1141

1142

1143

1144

1145

1146

1147

1148

1149

1150

1151

1152

1153

1154

1155

1156

1157

1158

1159

1160

Reports 5994, 7 (2017).[37] Olga Russakovsky, Jia Deng, Zhiheng Huang, Alexander C. Berg, and Li Fei-Fei.

2013. Detecting avocados to zucchinis: what have we done, and where are wegoing?. In International Conference on Computer Vision (ICCV).

[38] Fereshteh Sadeghi and Marshall F. Tappen. 2012. Latent Pyramidal Regions forRecognizing Scenes. In Proceedings of the 12th European Conference on ComputerVision - Volume Part V. 228–241.

[39] F. Schroff, D. Kalenichenko, and J. Philbin. 2015. FaceNet: A unified embeddingfor face recognition and clustering. In 2015 IEEE Conference on Computer Visionand Pattern Recognition (CVPR). 815–823.

[40] Iulian Vlad Serban, Alberto García-Durán, Caglar Gulcehre, Sungjin Ahn, SarathChandar, Aaron Courville, and Yoshua Bengio. 2016. Generating Factoid Ques-tions With Recurrent Neural Networks: The 30M Factoid Question-Answer Cor-pus. arXiv preprint arXiv:1603.06807 (2016).

[41] Edgar Simo-Serra and Hiroshi Ishikawa. 2016. Fashion Style in 128 Floats: JointRanking and Classification Using Weak Data for Feature Extraction. In The IEEEConference on Computer Vision and Pattern Recognition (CVPR).

[42] K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Networks forLarge-Scale Image Recognition. CoRR (2014).

[43] Kihyuk Sohn. 2016. Improved deep metric learning with multi-class n-pair lossobjective. In Advances in Neural Information Processing Systems. 1857–1865.

[44] Hyun Oh Song, Stefanie Jegelka, Vivek Rathod, and Kevin Murphy. 2017. DeepMetric Learning via Facility Location. In Conference on Computer Vision andPattern Recognition (CVPR).

[45] Damien Teney, Lingqiao Liu, and Anton van den Hengel. 2017. Graph-StructuredRepresentations for Visual Question Answering. In CVPR.

[46] Kristina Toutanova and Danqi Chen. 2015. Observed versus latent featuresfor knowledge base and text inference. In Proceedings of the 3rd Workshop onContinuous Vector Space Models and their Compositionality. 57–66.

[47] Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and GuillaumeBouchard. 2016. Complex embeddings for simple link prediction. arXiv preprintarXiv:1606.06357 (2016).

[48] Kristen Vaccaro, Sunaya Shivakumar, ZiqiaoDing, Karrie Karahalios, and RanjithaKumar. 2016. The Elements of Fashion Style. In Proceedings of the 29th AnnualSymposium on User Interface Software and Technology. ACM.

[49] Andreas Veit, Balazs Kovacs, Sean Bell, Julian McAuley, Kavita Bala, and SergeBelongie. 2015. Learning Visual Clothing Style with Heterogeneous DyadicCo-occurrences. In ICCV.

[50] Jian Wang, Feng Zhou, Shilei Wen, Xiao Liu, and Yuanqing Lin. 2017. DeepMetric Learning with Angular Loss. In International Conference on ComputerVision (ICCV).

[51] David S. Wishart, Craig Knox, Anchi Guo, Dean Cheng, Savita Shrivastava, DanTzur, Bijaya Gautam, and Murtaza Hassanali. 2008. DrugBank: a knowledgebasefor drugs, drug actions and drug targets. Nucleic Acids Research 36 (2008), 901–906.

[52] Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba.2010. SUN database: Large-scale scene recognition from abbey to zoo. In TheTwenty-Third IEEE Conference on Computer Vision and Pattern Recognition. 3485–3492.

[53] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. 2014. Learn-ing multi-relational semantics using neural-embedding models. arXiv preprintarXiv:1411.4072 (2014).

[54] Bangpeng Yao and Li Fei-Fei. 2010. Modeling mutual context of object andhuman pose in human-object interaction activities. InComputer Vision and PatternRecognition (CVPR), 2010 IEEE Conference on. 17–24.

[55] Z. Zhang and V. Saligrama. 2015. Zero-shot learning via semantic similarityembedding. In ICCV.

[56] Yuke Zhu, Alireza Fathi, and Li Fei-Fei. 2014. Reasoning about object affordancesin a knowledge base representation. In European conference on computer vision.408–424.

2018-06-29 17:56. Page 10 of 1–10.