![Page 1: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/1.jpg)
1
Multimodal Knowledge GraphsGeneration Methods, Applications, and Challenges
Shih‐Fu Chang
Alireza Zareian, Hassan Akbari, Brian Chen, Svebor Karaman, Zhecan James Wang, and Haoxuan You
Columbia University
Prof. Heng Ji, Manling Li, Di Lu, and Spencer WhiteheadUniversity of Illinois, Urbana‐Champaign
![Page 2: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/2.jpg)
Knowledge Graphs Entities, events, relations, etc.
2
Text IE
VisitIsrael
Prince William
The first-ever official visit by a British royal to Israel is underway. Prince William the 36-year-old Duke of Cambridge and second in line to the throne will meet with both Israeli and Palestinian leaders over the next three days.
![Page 3: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/3.jpg)
Knowledge Graphs Entities, events, relations, etc. Events describe what happens
Entities are characterized by the argument role they play in events
3
Text IE
VisitIsrael
Prince William
The first-ever official visit by a British royal to Israel is underway Prince William the 36-year-old Duke of Cambridge and second in line to the throne will meet with both Israeli and Palestinian leaders over the next three days.
Agent
Destination
![Page 4: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/4.jpg)
Application: Question Answering, Reasoning, Hypothesis Verification and Discovery
Knowledge Graphs
4
Text IE
VisitIsrael
Prince William
Find recent visits of politicians to Israel.
Answers:
The first-ever official visit by a British royal to Israel is underway Prince William the 36-year-old Duke of Cambridge and second in line to the throne will meet with both Israeli and Palestinian leaders over the next three days.
Agent
Destination
![Page 5: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/5.jpg)
Knowledge Beyond Text• We communicate through multimedia
• Our experiment shows 34% of news images contain event arguments that are not mentioned in text
TransportPerson_Instrument = stretcher
Stretcher
Fire
5
![Page 6: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/6.jpg)
Why Multimodal? Visual data contains complementary data used for:
Visual Illustration Disambiguation Additional Details
6
AttackProtesters
Bus
Agent Target
Instrument
Stone
Transport
Instrument
Transport
Woundedprotester
Agent
Person
Supporters
Person
Destination
Rally
![Page 7: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/7.jpg)
Challenges & Applications Challenges:
Parsing images/videos to structures Grounding event/entities across modalities Extracting complementary multimodal
arguments
7
Text IE
Visual IE
?
Application
Scene graphText graph
Multi-ModalKnowledge Graph
![Page 8: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/8.jpg)
Challenge 1: Parsing Images to Scene Graphs Extract structured representation of a scene
Entities and their semantic relationships
8
Object Detection
![Page 9: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/9.jpg)
Parsing Images to Scene Graphs Existing method
Extract object proposals
Contextualize features by RNN (or message passing)
Classify all nodes and pairs of nodes
Limitations Computationally exhaustive
𝑂 𝑛 for 𝑛 100 proposals
Difficult to model higher order relationships, e.g. “girl eating cake with fork”
Requires full supervision
9
(Xu et. al, CVPR 2017)
Neural Motifs (Zellers, Yatskar, Thomson, Choi, CVPR 2018)One of the SOTA methods for scene graph generation
![Page 10: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/10.jpg)
Reformulate as an Event-Centric Problem Our work: Visual Semantic Parsing Network (Zareian et al. CVPR19)
Generalized formulation of scene graph generation Entity-centric bipartite representation of predicates & entities
Reduce computational complexity from 𝑂 𝑛 to sub-quadratic
Model argument role relations beyond (subject, object), (agent, patient) relations
10
eating
holding
belongs
agent
patient
Girl
Cake
Hand
Fork instrument
![Page 11: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/11.jpg)
Reformulate as an Event-Centric Problem Our work: Visual Semantic Parsing Network (Zareian et al. CVPR20)
Generalized formulation of scene graph generation Entity-centric bipartite representation of predicates & entities
Reduce computational complexity from 𝑂 𝑛 to sub-quadratic
Model argument role relations beyond (subject, object), (agent, patient) relations
11
eating
holding
belong
agent
patient
Girl
Cake
Hand
Fork
instrument
![Page 12: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/12.jpg)
Bipartite Embeddings for Entity & Predicate
12
𝐻 1
𝐻 2
𝐻 𝑛
𝐻 1
𝐻 2
𝐻 3
𝐻 𝑛
…
…
RPN
RoIAlign
TrainablePredicate
Embedding Bank
![Page 13: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/13.jpg)
Initialize entity and predicate nodes Compute role-specific attention scores
Input: entity-predicate feature pairs
Output: scalar for each thematic role
Argument Role Prediction
13
𝐻 1
𝐻 2
𝐻 𝑛
𝐻 1
𝐻 2
𝐻 3
𝐻 𝑛
…
…
FC
FC
…
agentpatient
instrument
![Page 14: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/14.jpg)
Role-Dependent Message Passing Bi-directional Message passing Entities Roles Predicates
14
𝐻 1
𝐻 2
𝐻 𝑛
𝐻 1
𝐻 2
𝐻 3
𝐻 𝑛
…
…
…
agen
tpa
tient
inst
rum
ent
………
…
FC _→ .
FC _→ .
FC _→ .
FC _→ .
FC _→
FC _→
FC _→ .
…
FC _→
Message Passing
![Page 15: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/15.jpg)
Role-Dependent Message Passing Bi-directional Message passing Entities Roles Predicates
15
𝐻 1
𝐻 2
𝐻 𝑛
𝐻 1
𝐻 2
𝐻 3
𝐻 𝑛
…
…
…
agen
tpa
tient
inst
rum
ent
………
…
FC _→
FC _→
FC _→
FC _→
FC _→
FC _→ .
…
FC _→
Message Passing
![Page 16: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/16.jpg)
Visual Semantic Parsing Network Bi-directional Message passing Repeat for 𝑢 iterations Classify nodes and edges
16
𝐻 1
𝐻 2
𝐻 𝑛
𝐻 1
𝐻 2
𝐻 3
𝐻 𝑛
…
…………… …
…
agentpatient
instrument
……… …
eating
holding
belong
Girl
Cake
Hand
Fork
…
…
FC
FC
Binarize
![Page 17: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/17.jpg)
Visual Semantic Parsing Network Weakly supervised training
Unknown alignment between output and ground truth graphs
17
𝐻 1
𝐻 2
𝐻 𝑛
𝐻 1
𝐻 2
𝐻 3
𝐻 𝑛
…
…………… … …
agentpatient
instrument
…… … …
eating
holding
belong
Girl
Cake
Hand
Fork
…
…
Ground truth𝓛𝑬 𝓛𝑷𝓛𝑹
Girl | 𝐶 1
Cake| 𝐶 2
Hand| 𝐶 3
Fork| 𝐶 𝑛
eating| 𝐶 1
belong| 𝐶 𝑛
holding| 𝐶 2
![Page 18: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/18.jpg)
Visual Semantic Parsing Network
18
![Page 19: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/19.jpg)
Incorporate External KB (Zareian, et al, ECCV20)
Link concepts in scene graphs to external knowledge bases such as ConceptNet
Pass messages over bridges between scene graphs and external graphs
Refine bridges between graphs
19
![Page 20: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/20.jpg)
Scene Graph Examples of GB-NET
20
Ours (GB-Net) Baseline (KERN) Ours (GB-Net) Baseline (KERN)
![Page 21: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/21.jpg)
Challenge 2: Text-Visual Grounding (Akbari et al CVPR19)
21
Localize text query in image Bridge visual and text knowledge graphs Without using predefined classifiers
Challenges Sensitive to domain variations Abstract concept not groundable
![Page 22: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/22.jpg)
Challenge 3: Multimodal Event & Argument Extraction
Challenges: Parsing images/videos to structures Grounding entities across modalities Joint extraction of multimodal
argument
22
Text IE
Visual IE
?
Application
Scene graphText graph
Multi-ModalKnowledge Graph
![Page 23: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/23.jpg)
Multimodal KG Example
23
AttackProtesters
Bus
Agent Target
Instrument
Stone
Transport
Instrument
Transport
Woundedprotester
Agent
Person
Supporters
Person
Destination
Rally
![Page 24: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/24.jpg)
Event Movement.TransportPerson deploy
Arguments
Transporter United StatesDestination outskirtsPassenger soldiers
Vehicle land vehicleVehicle land vehicle
Last week , U.S . Secretary of State Rex Tillersonvisited Ankara, the first senior administration official to visit Turkey, to try to seal a deal about the battle for Raqqa and to overcome President Recep Tayyip Erdogan's strong objections to Washington's backing of the Kurdish Democratic Union Party (PYD) militias. Turkish forces have attacked SDF forces in the past around Manbij, west of Raqqa, forcing the United States to deploy dozens of soldiers on the outskirtsof the town in a mission to prevent a repeat of clashes, which risk derailing an assault on Raqqa.
Input: News article text and image
Output: Image‐related Events & Visual Argument Roles
land vehicleland vehicle
A New Task: Multimedia Event Extraction (M2E2)
24
![Page 25: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/25.jpg)
A New Task: Multimedia Event Extraction (M2E2)
Event Conflict.Attack airstrikes
ArgumentsAttacker U.S.-led coalition forces
Target airplane
Target vehicle
Output: Image‐related Events & Visual Argument Roles
Input: News article text and imageIn March , Turkish forces escalated attacks on the YPG innorthern Syria , forcing U.S. to deploy a small number offorces in and around the town of Manbij to the northwestof Raqqa to “deter” Turkish - SDF clashes and ensure thefocus remains on Islamic State. Meanwhile, Raqqa isbeing pummeled by airstrikes mounted by U.S.-ledcoalition forces and Syrian warplanes. Local anti-ISactivists say the air raids fail to distinguish betweenmilitary and non-military targets …
airplane vehicle
25
![Page 26: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/26.jpg)
• Treat image as another language• Represent it with a structure that is similar to AMR in text• Can we find a common representation?
placemeans
Cross‐media Structured Common Space
26
Linguistic Structure (Abstract Meaning Representation (AMR) /
Dependency Tree)
Visual Semantic Graph[Zareian et al. CVPR20]
![Page 27: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/27.jpg)
Image to Event Graph• ImSitu dataset: situation recognition (Yatskar et al., 2016)
• Classify an image as one of 500+ FrameNet verbs (sharing part of ACE)
• Identify 192 generic semantic roles
27
![Page 28: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/28.jpg)
28
Weakly Aligned Structured Embedding (WASE) ‐‐ Cross‐media shared representation and classifiers (Li, Zareian, et al, ACL20)
![Page 29: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/29.jpg)
• Prior work aligns image‐caption vectors by triplet loss.• We want to align two graphs, not just single vectors.
Use image‐caption data for graph alignment
Cross-A
ttention
X
–Loss29
![Page 30: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/30.jpg)
Cross-A
ttention
X
–Loss
30
• Prior work aligns image‐caption vectors by triplet loss.• We want to align two graphs, not just single vectors.
Use image‐caption data for graph alignment
![Page 31: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/31.jpg)
• Ontology: shared between ACE and imSitu• Event Types: cover 52% of ACE event types• Argument Roles: Based on ACE argument roles, add additional
detectable visual roles (marked in red)
Event Type Argument RolesLife.Die Agent, Victim, Instrument, Place, TimeTransaction.TransferMoney Giver, Recipient, Beneficiary, Money, Instrument, Place, Time
Conflict.Attack Attacker, Instrument, Place, Target, Time
Conflict.Demonstrate Demonstrator, Instrument, Police, Place, Time
Contact.Phone-Write Participant, Instrument, Place, Time
Contact.Meet Participant, Place, Time
Justice.ArrestJail Agent, Person, Instrument, Place, Time
Movement.Transport Agent, Artifact/Person, Instrument, Destination, Origin, Time
A New Multimodal Dataset for M2E2 Evaluation
31
(Li, Zareian, et al, ACL20)
![Page 32: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/32.jpg)
32
Experiment Results
Training with MM
Multimodal Task
![Page 33: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/33.jpg)
Compare to Single Modality Extraction
• Image helps textual event extraction, and surrounding sentence helps visual event extraction
33
Missed by text-only model
Misclassified by image-only model as “Demonstration”
![Page 34: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/34.jpg)
Application 1: Visual Commonsense Reasoning (VCR)
Understand semantics in images and language, explore commonsense Provide to-the-point answer
34Zellers et al. CVPR 2019
![Page 35: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/35.jpg)
Combine Visual Scene Graphs with VCRExpand input to include objects and predicate relations in graphAttention transformers limited to sparse connections in scene graphs
35
[CLS] Why … ? [SEP] …
Graph-based Global-Local Attention Transformers (GLAT, ECCV’20)
…… person1
entity
predicate object
subject
coreference
masking Image-text matching object/relation recognition QA
object object predicate
![Page 36: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/36.jpg)
1 2
3 45
Graph-based Global-Local Attention Transformers (Zareian, et al ECCV20)
36
1 2
3 4
layer 2
layer L
…
concat + linear
local heads
global heads
1111
1
5555
5
layer 1
Node Classifier
Edge Classifier
decoder
person
riding behind
mountain
?
5
1 2
3 45
1 2
3 45
… person
riding behind
mountain
bike
1 2
3 45
entity
predicate object
subject
person
riding behind
mountain
horseground truth
Node & Edge Loss
![Page 37: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/37.jpg)
Model Type (Entity #, Predicate #) Q -> A
LXMERTInitial Graph (36,18) 65.09 (baseline)
Relevance Sel. (8, x) 74.04 (+8.95)
GLAT(LXMERT)
Initial Graph (36, 18) 65.24 (baseline)
Relevance Sel. (26, x) 69.57 (+4.33)
Relevance Sel. (18, x) 72.33 (+7.09)
Relevance Sel. (8, x) 74.45 (+9.21)
Scene Graph + Query-Adaptive Concept Selection● For each question, select most relevant nodes on the scene graph
![Page 38: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/38.jpg)
Q: Why is sheep near the construction ?A: Sheep is near its natural habitat as well.
Initial Graph
man, vest, pants, building, rock, sky, window, shirt (sorted by confidence score from SG)
Relevance, Question
building, door, man, men, window, rock, ground, animal(sorted by relevance score against question)
Relevance, Question + AnswerCandidate
man, building, animal, dirt, rock, gate, ground, plant(sorted by relevance score against question +answer candidate)
![Page 39: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/39.jpg)
Application 2: Multimodal KG Extraction from COVID‐19 Medical Papers
39
Figure 1.FDA approved drugs of most interest for repurposing as potential Ebola virus treatments.
KG from caption text
FDA
Drugs Ebola
approverepurpose
PDF images extraction, segmentation, and recognition
Multimedia Knowledge Graph Construction
Treatment
![Page 40: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/40.jpg)
Conclusions Multimodal Knowledge Graphs
Understanding semantic structures in both language and vision Joint representation and models
Applications Reasoning (VCR) Discovery (COVID-19)
Challenges Open-vocabulary and Self-Supervised models Knowledge graphs for video Commonsense Extraction from MM KG
physics, behavior, causal/temporal
40
Text IE
Visual IE
?
Application
Scene graphText graph
Multi-ModalKnowledge Graph
![Page 41: Multimodal Knowledge Graphs - SuitClub · 2020. 8. 31. · Application: Question Answering, Reasoning, Hypothesis Verification and Discovery Knowledge Graphs 4 Text IE Visit Israel](https://reader036.vdocument.in/reader036/viewer/2022071301/609eb3333017d508c974e19f/html5/thumbnails/41.jpg)
References Zareian, Alireza, Svebor Karaman, and Shih-Fu Chang. "Weakly Supervised Visual
Semantic Parsing." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR 2020.
Zareian, Alireza, Svebor Karaman, and Shih-Fu Chang. "Bridging knowledge graphs to generate scene graphs." arXiv preprint arXiv:2001.02314 (2020). ECCV 2020.
Akbari, Hassan, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl Vondrick, and Shih-Fu Chang. "Multi-level multimodal common semantic space for image-phrase grounding." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.
Li, Manling, Alireza Zareian, Qi Zeng, Spencer Whitehead, Di Lu, Heng Ji, and Shih-Fu Chang. "Cross-media Structured Common Space for Multimedia Event Extraction." arXiv preprint arXiv:2005.02472 (2020). ACL 2020.
Zareian, Alireza, Haoxuan You, Zhecan Wang, and Shih-Fu Chang. "Learning Visual Commonsense for Robust Scene Graph Generation." arXiv preprint arXiv:2006.09623 (2020). ECCV 2020.
41