![Page 1: Visual Dialog Abhishek Das, Satwik Kottur, Khushi Gupta ... · Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M.F. Moura, Devi Parikh, Dhruv Batra Presented](https://reader034.vdocument.in/reader034/viewer/2022050323/5f7ccbd5f4ebc4690b1ce52e/html5/thumbnails/1.jpg)
Visual DialogAbhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M.F. Moura,
Devi Parikh, Dhruv BatraPresented by Wei-Chieh Wu
![Page 2: Visual Dialog Abhishek Das, Satwik Kottur, Khushi Gupta ... · Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M.F. Moura, Devi Parikh, Dhruv Batra Presented](https://reader034.vdocument.in/reader034/viewer/2022050323/5f7ccbd5f4ebc4690b1ce52e/html5/thumbnails/2.jpg)
Visual Dialog
• Requires an AI agent to hold a meaningful dialog with humans about visual content.
• Input:• Image• Dialog history• Question
• Output:• Answer to the question
![Page 3: Visual Dialog Abhishek Das, Satwik Kottur, Khushi Gupta ... · Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M.F. Moura, Devi Parikh, Dhruv Batra Presented](https://reader034.vdocument.in/reader034/viewer/2022050323/5f7ccbd5f4ebc4690b1ce52e/html5/thumbnails/3.jpg)
VQA vs Visual Dialog
![Page 4: Visual Dialog Abhishek Das, Satwik Kottur, Khushi Gupta ... · Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M.F. Moura, Devi Parikh, Dhruv Batra Presented](https://reader034.vdocument.in/reader034/viewer/2022050323/5f7ccbd5f4ebc4690b1ce52e/html5/thumbnails/4.jpg)
VisDial Dataset
• Contains ~123k images and 10 question-answer pairs for each image
• Images are from COCO dataset• Question-answer pairs are collected on AMT with
human dialog
![Page 5: Visual Dialog Abhishek Das, Satwik Kottur, Khushi Gupta ... · Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M.F. Moura, Devi Parikh, Dhruv Batra Presented](https://reader034.vdocument.in/reader034/viewer/2022050323/5f7ccbd5f4ebc4690b1ce52e/html5/thumbnails/5.jpg)
VisDial Dataset
![Page 6: Visual Dialog Abhishek Das, Satwik Kottur, Khushi Gupta ... · Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M.F. Moura, Devi Parikh, Dhruv Batra Presented](https://reader034.vdocument.in/reader034/viewer/2022050323/5f7ccbd5f4ebc4690b1ce52e/html5/thumbnails/6.jpg)
VisDial Dataset
![Page 7: Visual Dialog Abhishek Das, Satwik Kottur, Khushi Gupta ... · Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M.F. Moura, Devi Parikh, Dhruv Batra Presented](https://reader034.vdocument.in/reader034/viewer/2022050323/5f7ccbd5f4ebc4690b1ce52e/html5/thumbnails/7.jpg)
Evaluation
• Given N = 100 candidate answers, return a sorting of them
• Candidate answers:• The human response• Answers to 50 most similar questions• 30 most popular answers from the dataset• 19 random answers
• Retrieval metrics: MRR, recall@k, average rank of the human response
![Page 8: Visual Dialog Abhishek Das, Satwik Kottur, Khushi Gupta ... · Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M.F. Moura, Devi Parikh, Dhruv Batra Presented](https://reader034.vdocument.in/reader034/viewer/2022050323/5f7ccbd5f4ebc4690b1ce52e/html5/thumbnails/8.jpg)
Models
• Following the encoder-decoder framework• 2 kinds of decoder
• Generative Decoder• Discriminative Decoder
• 3 kinds of encoder • Late Fusion Encoder• Hierarchical Recurrent Encoder• Memory Network Encoder
![Page 9: Visual Dialog Abhishek Das, Satwik Kottur, Khushi Gupta ... · Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M.F. Moura, Devi Parikh, Dhruv Batra Presented](https://reader034.vdocument.in/reader034/viewer/2022050323/5f7ccbd5f4ebc4690b1ce52e/html5/thumbnails/9.jpg)
Decoders
• Generative Decoder• LSTM decoder• Maximize the log-likelihood of the ground truth answer • Use the model’s log-likelihood scores for ranking
• Discriminative Decoder• Compute similarity between the input encoding and
LSTM encoding for candidate answers• Maximize softmax score of the ground truth answer • Use the similarities for ranking
![Page 10: Visual Dialog Abhishek Das, Satwik Kottur, Khushi Gupta ... · Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M.F. Moura, Devi Parikh, Dhruv Batra Presented](https://reader034.vdocument.in/reader034/viewer/2022050323/5f7ccbd5f4ebc4690b1ce52e/html5/thumbnails/10.jpg)
Late Fusion (LF) Encoder
![Page 11: Visual Dialog Abhishek Das, Satwik Kottur, Khushi Gupta ... · Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M.F. Moura, Devi Parikh, Dhruv Batra Presented](https://reader034.vdocument.in/reader034/viewer/2022050323/5f7ccbd5f4ebc4690b1ce52e/html5/thumbnails/11.jpg)
Hierarchical Recurrent Encoder (HRE)
![Page 12: Visual Dialog Abhishek Das, Satwik Kottur, Khushi Gupta ... · Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M.F. Moura, Devi Parikh, Dhruv Batra Presented](https://reader034.vdocument.in/reader034/viewer/2022050323/5f7ccbd5f4ebc4690b1ce52e/html5/thumbnails/12.jpg)
Memory Network (MN) Encoder
![Page 13: Visual Dialog Abhishek Das, Satwik Kottur, Khushi Gupta ... · Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M.F. Moura, Devi Parikh, Dhruv Batra Presented](https://reader034.vdocument.in/reader034/viewer/2022050323/5f7ccbd5f4ebc4690b1ce52e/html5/thumbnails/13.jpg)
Experiments
• Dataset: VisDial v0.9• Baseline
• NN-Q:Find k nearest neighbor questions for a test question, and score answers by their mean similarity with these k answers
• NN-QI: Find K nearest neighbor questions for a test question, then find a subset of size k based on image feature similarity. Score answers by their mean similarity with these k answers
• VQA models • SAN and HieCoAtt• Feed VQA outputs to their discriminative decoder, and train end-to-
end on VisDial
![Page 14: Visual Dialog Abhishek Das, Satwik Kottur, Khushi Gupta ... · Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M.F. Moura, Devi Parikh, Dhruv Batra Presented](https://reader034.vdocument.in/reader034/viewer/2022050323/5f7ccbd5f4ebc4690b1ce52e/html5/thumbnails/14.jpg)
Results