the promise and the perils - jmhessel.com · the promise and the perils of learning grounding from...

113
The Promise and the Perils of Learning Grounding from VisualTextual Web data Jack Hessel Cornell University

Upload: others

Post on 03-Mar-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

The Promise and the Perilsof Learning Grounding from Visual-Textual Web data

Jack HesselCornell University

Page 2: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

What is visual-textual grounding?

Page 3: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

What is visual-textual grounding?

A collection of tasks requiring connection between visual and textual content.

Page 4: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

What is visual-textual grounding?

A collection of tasks requiring connection between visual and textual content.

[Wu et al. 2017;Sharma et al. 2019]

Alt-text Generation

Page 5: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

What is visual-textual grounding?

A collection of tasks requiring connection between visual and textual content.

[Matuszek et al. 2012]

"Here are the yellow ones"

Human-Robot Interaction

[Wu et al. 2017;Sharma et al. 2019]

Alt-text Generation

Page 6: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

What is visual-textual grounding?

A collection of tasks requiring connection between visual and textual content.

[Matuszek et al. 2012]

"Here are the yellow ones"

Human-Robot Interaction

[Kim et al. 2014]

Web Video Parsing

[Wu et al. 2017;Sharma et al. 2019]

Alt-text Generation

Page 7: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Why study visual-textual grounding?Cross-modal reasoning is easy for humans, hard for computers

[Zhu et al. 2016;Photo by Nathan Rupert]

Page 8: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Why study visual-textual grounding?Cross-modal reasoning is easy for humans, hard for computers

[Zhu et al. 2016;Photo by Nathan Rupert]

Page 9: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Why study visual-textual grounding?Cross-modal reasoning is important beyond AI

Cognitive psychology worksince at least the 1970s.

[Miller and Johnson-Laird 1976]

"Symbol Grounding Problem"

[Harnad 1990]

"How are those symbols(e.g., the words in our heads)

connected to the things they refer to?"

Page 10: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Why study multimodal web data?

Page 11: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Noisy web data is unreasonably effective

Why study multimodal web data?

Page 12: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Noisy web data is unreasonably effective

Why study multimodal web data?

Web data is "the best ally we have"

--- Halevy, Norvig, and Pereira, 2009

Page 13: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Noisy web data is unreasonably effective

Why study multimodal web data?

Page 14: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Noisy web data is unreasonably effective

[Zhukov et al. 2019;Zhou et al. 2018]

Unimodal Tasks Image+Text Tasks Video+Text Tasks

[Deng et al. 2009;Wang et al. 2019]

[Goyal et al. 2017; Suhr et al. 2018; Hudson and Manning, 2019;

Young et al. 2014]

2

Flickr1K

Crosstask

Why study multimodal web data?

Page 15: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Noisy web data is unreasonably effective

[Zhukov et al. 2019;Zhou et al. 2018]

Unimodal Tasks Image+Text Tasks Video+Text Tasks

[Deng et al. 2009;Wang et al. 2019]

[Goyal et al. 2017; Suhr et al. 2018; Hudson and Manning, 2019;

Young et al. 2014]

2

Flickr1K

Crosstask

[Sharma et al. 2018]

3M Webly SupervisedImage-Caption Pairs

[Miech et al. 2019]

100M Web Video Clips + ASR

HowTo100M

[Mahajan et al. 2018; Raffel et al. 2019]

3.5B Tagged Instagram Images34B Web Tokens

Why study multimodal web data?

Page 16: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Important for understanding web communication

Why study multimodal web data?

Page 17: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Important for understanding web communication

Why study multimodal web data?

cnn.com, bbc.co.uk...

imgur.com,youtube...

Page 18: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Important for understanding web communication

Why study multimodal web data?

cnn.com, bbc.co.uk...

imgur.com,youtube... Semioticians have long argued

multimodality is a fundamental part of communication

"The power of visual communication is multiplied when it is co-deployed with

language in multimodal texts."[Lemke 2002]

Page 19: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?
Page 20: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

My Research Goals:

Page 21: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

My Research Goals:

build bettergrounding algorithms

understand webcommunication

Page 22: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

My Research Goals:

build bettergrounding algorithms

understand webcommunication

requires

need for cross-modal reasoning, real-world

knowledge, etc.

Page 23: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

My Research Goals:

build bettergrounding algorithms

understand webcommunication

requires

need for cross-modal reasoning, real-world

knowledge, etc.

designmore effectiveunsupervised

training objectives

for web data

requires

Page 24: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Computer Vision

Natural Language Processing

Computational Social Science

[WWW 2017]

[NAACL 2019]

[ICWSM 2016]

[NAACL 2018]

[CoNLL 2019]

[EMNLP 2019]

Page 25: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

The Promise and the Perils

Page 26: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

The Promise and the Perils

We can do cool things with multimodal webdata,but web texts are not literal image descriptions

(even though most algorithms treat them that way)

Page 27: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

The Promise and the Perils

Page 28: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

The Promise and the Perils

Learning from multi-sentence, multi-image web documents

[EMNLP 2019: Hessel, Lee, Mimno]

Learning from unlabelledweb videos + ASR

[CoNLL 2019: Hessel, Pang, Zhu, Soricut;In Sub: Hessel, Zhu, Pang, Soricut]

Page 29: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

The Promise and the Perils

Learning from multi-sentence, multi-image web documents

[EMNLP 2019: Hessel, Lee, Mimno]

Learning from unlabelledweb videos + ASR

[CoNLL 2019: Hessel, Pang, Zhu, Soricut;In Sub: Hessel, Zhu, Pang, Soricut]

Page 30: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Multi-image, Multi-sentence documents?

Image captioning case:one image, one sentenceexplicit link by annotation

Our case:multiple images, multiple sentences

no explicit links

Page 31: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

The Task: Unsupervised Link Prediction

Page 32: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

What's hard about this link prediction task?

- No explicit labels!- Sentences may have no image- Images may have no sentence- Sentences may have multiple images- Images may have multiple sentences

Page 33: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Multi-image/multi-sentence pretraining framework:

Web pages, product listings, books (current and historical), web comments

on images, news articles...

Page 34: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

The Task: Unsupervised Link Prediction

Page 35: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Why you might care about same document retrieval:

Page 36: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

"I think it's a group of people riding on the

back of a boat."

Why you might care about same document retrieval:

Page 37: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

"I think it's a group of people riding on the

back of a boat."

"General Washington is emphasized by an unnaturally bright sky, while his face catches the upcoming sun. The colors consist of mostly dark tones, as is to be expected at dawn, but there are red highlights."

Why you might care about same document retrieval:

Page 38: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Stats for Web Datasets

# sentences/doc # images/doc

Page 39: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Model

BipartiteAssignment

(structured prediction)

Similarity Score

Page 40: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Training: Max Margin loss with Negative Sampling

I took the kids down to the river on this fine spring day.

The river has always fascinated me. It's not a huge river, but it has...

[male] had his adorable hat on, and I loved watching

him watch the water

He found a rock he liked, and asked to take it home.

[male] pointed at everything he saw, and I loved his

enthusiasm.

Page 41: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Training: Max Margin loss with Negative Sampling

I took the kids down to the river on this fine spring day.The river has always

fascinated me. It's not a huge river, but it has...[male] had his adorable hat

on, and I loved watching him watch the waterHe found a rock he liked, and asked to take it home.[male] pointed at everything

he saw, and I loved his enthusiasm.

Page 42: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Training: Max Margin loss with Negative Sampling

I took the kids down to the river on this fine spring day.The river has always

fascinated me. It's not a huge river, but it has...[male] had his adorable hat

on, and I loved watching him watch the waterHe found a rock he liked, and asked to take it home.[male] pointed at everything

he saw, and I loved his enthusiasm.

Maximize Similarity

Page 43: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Training: Max Margin loss with Negative Sampling

I took the kids down to the river on this fine spring day.The river has always

fascinated me. It's not a huge river, but it has...[male] had his adorable hat

on, and I loved watching him watch the waterHe found a rock he liked, and asked to take it home.[male] pointed at everything

he saw, and I loved his enthusiasm.

,,{ }

{ }, ,

Maximize Similarity

Minimize Similarity

Minimize Similarity

Page 44: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Quantitative Resultswe have labels that are only used at test-time for evaluation for these datasets

(Higher = better)

Ours

Page 45: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

WIKI Prediction on 100-sentence Mauritius Article

First sighted by Europeans around 1600 on Mauritius, the dodo became extinct less than eighty years later.

(84.5)

This archipelago was formed in a series of

undersea volcanic eruptions 8-10 million

years ago...(93.9)

The island is well known for its natural

beauty.(92.1)

Mauritian Créole, which is spoken by 90 per cent of the

population, is considered to be the

native tongue...(68.3)

... a significant migrant population of Bhumihar Brahmins

in Mauritius who have made a mark for

themselves in different fields.

(79.8)For the dodo, the an object detection baseline's selected sentence began with:

“(Mauritian Creole people usually known as ‘Creoles’)”

Page 46: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

WIKI Prediction on 100-sentence Mauritius Article

First sighted by Europeans around 1600 on Mauritius, the dodo became extinct less than eighty years later.

(84.5)

This archipelago was formed in a series of

undersea volcanic eruptions 8-10 million

years ago...(93.9)

The island is well known for its natural

beauty.(92.1)

Mauritian Créole, which is spoken by 90 per cent of the

population, is considered to be the

native tongue...(68.3)

... a significant migrant population of Bhumihar Brahmins

in Mauritius who have made a mark for

themselves in different fields.

(79.8)For the dodo, the an object detection baseline's selected sentence began with:

“(Mauritian Creole people usually known as ‘Creoles’)”

Page 47: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

WIKI Prediction on 100-sentence Mauritius Article

First sighted by Europeans around 1600 on Mauritius, the dodo became extinct less than eighty years later.

(84.5)

This archipelago was formed in a series of

undersea volcanic eruptions 8-10 million

years ago...(93.9)

The island is well known for its natural

beauty.(92.1)

Mauritian Créole, which is spoken by 90 per cent of the

population, is considered to be the

native tongue...(68.3)

... a significant migrant population of Bhumihar Brahmins

in Mauritius who have made a mark for

themselves in different fields.

(79.8)For the dodo, the an object detection baseline's selected sentence began with:

“(Mauritian Creole people usually known as ‘Creoles’)”

Page 48: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

The Promise and the Perils

Learning from multi-sentence, multi-image web documents

[EMNLP 2019: Hessel, Lee, Mimno]

Learning from unlabelledweb videos + ASR

[CoNLL 2019: Hessel, Pang, Zhu, Soricut;In Sub: Hessel, Zhu, Pang, Soricut]

Page 49: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

The Promise and the Perils

Learning from multi-sentence, multi-image web documents

[EMNLP 2019: Hessel, Lee, Mimno]

Learning from unlabelledweb videos + ASR

[CoNLL 2019: Hessel, Pang, Zhu, Soricut;In Sub: Hessel, Zhu, Pang, Soricut]

Page 50: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Noisy ASR for Video Captioning

[Zhou et al. 2018]

Page 51: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

CNNCook the tomatoesin the pan

Embeddings..today we will first mix...

Noisy ASR for Video Captioning

decenc

Our hypothesis: this will help

Page 52: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Noisy ASR for Video Captioning

Prev. SoTA (video only)

Noisy ASR (text only)

Video+ASR (multimodal)

Page 53: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

The Promise and the Perils

Learning from multi-sentence, multi-image web documents

[EMNLP 2019: Hessel, Lee, Mimno]

Learning from unlabelledweb videos + ASR

[CoNLL 2019: Hessel, Pang, Zhu, Soricut;In Sub: Hessel, Zhu, Pang, Soricut]

Page 54: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

The Promise and the Perils

Page 55: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

[Lin et al 2014]

"Do not describe what a person might say."--- MSCOCO caption annotation guideline for mechanical turkers

Many datasets/algorithms focus only on literal objects/actions...

Page 56: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Image-text relationships on the webQ: "How does an illustration relate to the text with which it is associated, or, what

are the functions of illustration?"

[Marsh and Domas White, 2003]

Page 57: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Image-text relationships on the webQ: "How does an illustration relate to the text with which it is associated, or, what

are the functions of illustration?"

A: It depends!

[Marsh and Domas White, 2003]

Page 58: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

The Promise and the Perils

Page 59: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

The Promise and the Perils

What concepts are "groundable"?

[NAACL 2018, Hessel, Mimno, Lee]

"... beautiful ...""... dogs ..."

Does my model learn cross-modal interactions?

[In Sub to EMNLP 2020: Hessel, Lee;WWW 2017, Hessel, Lee, Mimno]

The grassis

alwaysgreener

Page 60: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

The Promise and the Perils

What concepts are "groundable"?

[NAACL 2018, Hessel, Mimno, Lee]

"... beautiful ...""... dogs ..."

Does my model learn cross-modal interactions?

[In Sub to EMNLP 2020: Hessel, Lee;WWW 2017, Hessel, Lee, Mimno]

The grassis

alwaysgreener

Page 61: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

"Performance advantages of [multi-modal approaches] over language-only models have been clearly established when models are required to learn concrete noun concepts."

[Hill and Korhonen 2014]

Page 62: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

The cat is in the grass.

This cat is enjoying the sun.

This is a beautiful baby.

The sunset is beautiful.

Page 63: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Conv Net

Image Feature Space

Beautiful Cat

Page 64: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Conv Net

Image Feature Space

Beautiful Cat

Page 65: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Connection to Geospatial Statistics

[Anselin 1995]

[Jacquez and Greiling 2003]

Page 66: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

COCO Results

Page 67: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

COCO ResultsMost concrete

wok 315.595 hummingbird 291.804 vane 290.037 racer 269.043 grizzly 229.274 equestrian 219.894 taxiing 205.410 unripe 201.733 siamese 199.024 delta 195.618kiteboarding 192.459 airways 183.971compartments 182.015 burners 180.553 stocked 177.472 spire 177.396 tulips 173.850 ben 171.936

Page 68: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

COCO ResultsMost concrete

wok 315.595 hummingbird 291.804 vane 290.037 racer 269.043 grizzly 229.274 equestrian 219.894 taxiing 205.410 unripe 201.733 siamese 199.024 delta 195.618kiteboarding 192.459 airways 183.971compartments 182.015 burners 180.553 stocked 177.472 spire 177.396 tulips 173.850 ben 171.936

Page 69: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

COCO ResultsMost concrete

wok 315.595 hummingbird 291.804 vane 290.037 racer 269.043 grizzly 229.274 equestrian 219.894 taxiing 205.410 unripe 201.733 siamese 199.024 delta 195.618kiteboarding 192.459 airways 183.971compartments 182.015 burners 180.553 stocked 177.472 spire 177.396 tulips 173.850 ben 171.936

Page 70: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

COCO ResultsMost concrete

wok 315.595 hummingbird 291.804 vane 290.037 racer 269.043 grizzly 229.274 equestrian 219.894 taxiing 205.410 unripe 201.733 siamese 199.024 delta 195.618kiteboarding 192.459 airways 183.971compartments 182.015 burners 180.553 stocked 177.472 spire 177.396 tulips 173.850 ben 171.936

Page 71: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

COCO ResultsMost concrete Somewhat concrete Not concrete

wok 315.595 hummingbird 291.804 vane 290.037 racer 269.043 grizzly 229.274 equestrian 219.894 taxiing 205.410 unripe 201.733 siamese 199.024 delta 195.618kiteboarding 192.459 airways 183.971compartments 182.015 burners 180.553 stocked 177.472 spire 177.396 tulips 173.850 ben 171.936

motorcycle 10.291 fun 10.267 including 10.262 lays 10.232 fish 10.184 goes 10.161 blurry 10.147 helmet 10.137 itself 10.128 umbrellas 10.108 teddy 10.060 bar 10.055 fancy 10.053 sticks 10.050 himself 10.038 take 10.016 steps 10.014 attempting 9.986

side 1.770 while 1.752 other 1.745 sits 1.741 for 1.730 behind 1.709 his 1.638 as 1.637 image 1.620 holding 1.619 this 1.602 picture 1.589 couple 1.585 from 1.569 large 1.568 person 1.561 looking 1.502 out 1.494

Page 72: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

More concrete = easier to learn

Bad news: success of retrieval objective largely determined by original feature geometry

Page 73: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Context matters!

"London"Top 1% Concrete

as a caption descriptor in MSCOCO.

"#London"Rank 1110/7K Concreteness

as a hashtag in a Flickr image tagging dataset.

Page 74: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

More concrete = easier to learn

Bad news: success of retrieval objective largely determined by original feature geometry

Page 75: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

More concrete = easier to learn

Bad news: success of retrieval objective largely determined by original feature geometry

Open question: what are the limits of retrieval-style algorithms at scale?

Page 76: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

hockey tennis

nintendoguns

baseballwrestling1 wrestling2

software auto racing

currency

170.2148.986.381.980.976.771.470.460.958.8

australia mexicopolicelaw male names community history time months linguistics

1.951.811.731.71

1.651.581.521.471.431.29

Experiments on Wikipedia with LDA topics:

Most Concrete Least Concrete

Page 77: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Use Case of Our Algorithm from Shi et al. 2019 (ACL Best Paper Nom.)

Idea: unsupervised constituency parsingbased on the concreteness of spans in image captions

(many more baselines in their paper)

Page 78: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

The Promise and the Perils

What concepts are "groundable"?

[NAACL 2018, Hessel, Mimno, Lee]

"... beautiful ...""... dogs ..."

Does my model learn cross-modal interactions?

[In Sub to EMNLP 2020: Hessel, Lee;WWW 2017, Hessel, Lee, Mimno]

The grassis

alwaysgreener

Page 79: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

The Promise and the Perils

What concepts are "groundable"?

[NAACL 2018, Hessel, Mimno, Lee]

"... beautiful ...""... dogs ..."

Does my model learn cross-modal interactions?

[In Sub to EMNLP 2020: Hessel, Lee;WWW 2017, Hessel, Lee, Mimno]

The grassis

alwaysgreener

Page 80: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Image-text relationships on the webQ: "How does an illustration relate to the text with which it is associated, or, what

are the functions of illustration?"

A: It depends!

[Marsh and Domas White, 2003]

Page 81: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Increasing number of multimodal, in-vivo studies

Page 82: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Increasing number of multimodal, in-vivo studies

Page 83: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

"Tonight, I carved a pumpkin. I also doused it in lighter fluid and lit it on fire." - /r/pics

"Snacks!" - /r/aww

"You have to go to the border for food Fish Tacos [San Diego]" - /r/FoodPorn

"Glamor Leaves" - /r/RedditLaquersitas

Our task: popularity ranking

Page 84: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?
Page 85: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?
Page 86: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?
Page 87: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

The grassis

alwaysgreener

Page 88: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Visual-textual interactions: "meaning multiplication"

The idea is that, under the right conditions, the value of a combination of different modes of meaning can be worth

more than the information (whatever that might be) that we get from the modes when used alone.

In other words, text "multiplied by" images is more than text simply occurring with or alongside images.

--- Bateman, 2014describing "Meaning Multiplication"

[Barthes 1988; Jones 1979]

Page 89: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Prediction Results

Multimodal beats unimodal!

Best unimodal

(image only)

Page 90: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Prediction Results

Multimodal beats unimodal!

Best unimodal

(image only)

[Ding et al. 2019's instagram results]

Page 91: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Highest Scores

Lowest Scores

Page 92: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Highest Scores

Lowest Scores

Page 93: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?
Page 94: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

In other words, text "multiplied by" images is more than text simply occurring with or alongside images.

--- Bateman, 2014describing "Meaning Multiplication"

[Barthes 1988; Jones 1979]

What is visual-textual grounding?

A collection of tasks requiring connection between visual and textual content.

Page 95: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

It can be difficult to tell what models learn...

[LXMERT: Tan and Bansal, 2019]

Page 96: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Can we formalize this a bit?

[Friedman 2001; Friedman et al. 2008; Hooker 2004]

The grassis

alwaysgreener

Multimodally additive modelThe

grassis

alwaysgreener

Multimodally interactive model

Page 97: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Simplifying models with function projection

Multimodally-additive models

Linear model

Ensemble of visual+textual

Kernel SVM

LXMERT

Neural Net

Page 98: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Simplifying models with function projection

Multimodally-additive models

Linear model

Ensemble of visual+textual

Kernel SVM

LXMERT

Neural Net

EmpiricalMultimodallyAdditiveProjection

(We prove: uniqueness + optimality)

Page 99: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?
Page 100: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?
Page 101: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?
Page 102: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?
Page 103: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?
Page 104: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Well-balanced VQA datasets don't have this property

Accuracy results on dev set for LXMERT,projected LXMERT, and constant prediction

Page 105: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Takeaway:report the multimodally-additive projection performance!

The grassis

alwaysgreener

Multimodally additive modelThe

grassis

alwaysgreener

Multimodally interactive model

Page 106: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

The Promise and the Perils

What concepts are "groundable"?

[NAACL 2018, Hessel, Mimno, Lee]

"... beautiful ...""... dogs ..."

Does my model learn cross-modal interactions?

[In Sub to EMNLP 2020: Hessel, Lee;WWW 2017, Hessel, Lee, Mimno]

The grassis

alwaysgreener

Page 107: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

The Promise and the Perils

Page 108: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

The Promise and the Perils

We can do cool things with multimodal webdata,but web texts are not literal image descriptions

(even though most algorithms treat them that way)

Page 109: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

My Research Goals:

build bettergrounding algorithms

understand webcommunication

requires requires

Page 110: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Thanks to my awesome collaborators!

David Mimno

Lillian Lee Bo Pang Radu Soricut

Zhenhai Zhu

Page 111: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

And thanks to you!!

Contact:[email protected]@jmhessel on Twitter

Code, data, and papers are all available:http://www.cs.cornell.edu/~jhessel/

"... beautiful ...""... dogs ..."

The Promise and the Perils

Page 112: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

Our contributions:- Fast algorithm for computing concreteness

- Extension from unigrams/bigrams to LDA topics- Demonstration that concreteness is context specific

Work on identifying hard/easy-to-ground concepts:

[Lu et al., 2008; Berg et al., 2010; Parikh and Grauman, 2011; Yatskar et al. 2013; Young et al., 2014; Kiela and Bottou, 2014; Jas and Parikh, 2015; Lazaridou et al., 2015; Silberer et al., 2016; Lu et al., 2017; Bhaskar et al., 2017; Mahajan et al., 2018; inter alia]

Page 113: The Promise and the Perils - jmhessel.com · The Promise and the Perils of Learning Grounding from Visual8Textual Web data Jack Hessel Cornell University. What is visual8textual grounding?

The empirical projection

(We prove: uniqueness + optimality)

Compute output for all image/text pairs,even mismatched ones not appearing

in the data.

Return predictions with only additive structure that are minimally distant (according to squared error) from

original predictions.