weakly-supervised learning from videos and...

29
Ivan Laptev [email protected] WILLOW, INRIA/ENS/CNRS, Paris Weakly - supervised learning from videos and scripts ERC ALLEGRO workshop INRIA Grenoble July 23, 2014 Joint work with: Piotr Bojanowski Rémi Lajugie Francis Bach Jean Ponce Cordelia Schmid Josef Sivic

Upload: others

Post on 30-Apr-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA

Ivan Laptev

[email protected]

WILLOW, INRIA/ENS/CNRS, Paris

Weakly-supervised learning from

videos and scripts

ERC ALLEGRO workshop

INRIA Grenoble

July 23, 2014

Joint work with: Piotr Bojanowski – Rémi Lajugie – Francis Bach –

Jean Ponce – Cordelia Schmid – Josef Sivic

Page 2: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA
Page 3: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA

Where to get training data?

Shoot actions in the lab•

KTH dataset

Weizman dataset,…

- Limited variability

- Unrealistic

Manually annotate existing content•

HMDB, Olympic Sports,

UCF50, UCF101, …

- Very time-consuming

Use readily-available video scripts•

www.dailyscript.com, www.movie-page.com, www.weeklyscript.com

- Scripts are available for 1000’s of hours of movies and TV-series

- Scripts describe dynamic and static content of videos

Page 4: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA

As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...

5

Page 5: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA

As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam.Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...

6

Page 6: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA

As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...

7

Page 7: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA

As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...

8

Page 8: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA

Scripts as weak supervision

Un

ce

rta

inty

24:25

24:51

Imprecise temporal localization•

No explicit spatial localization •

NLP problems, scripts ≠ training labels•

“… Will gets out of the Chevrolet. …”

“… Erin exits her new truck…”vs. Get-out-car

Challenges:

Page 9: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA

Previous work

Sivic, Everingham, and Zisserman,

''Who are you?'' -- Learning Person Specific

Classifiers from Video, In CVPR 2009.

Buehler, Everingham, and Zisserman "Learning

sign language by watching TV (using weakly

aligned subtitles)", In CVPR 2009.

Duchenne, Laptev, Sivic, Bach and Ponce,

"Automatic Annotation of Human Actions in

Video", In ICCV 2009.

…wanted to know about the history of the trees

Page 10: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA

Joint Learning of Actors and Actions

Rick? Rick?

Walks?Walks?

[Bojanowski et al. ICCV 2013]

Rick walks up behind Ilsa

Page 11: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA

Rick

Walks

Rick walks up behind Ilsa

Joint Learning of Actors and Actions[Bojanowski et al. ICCV 2013]

Page 12: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA

Formulation: Cost function

Rick

Ilsa

Sam

Actor labels Actor image features

Actor classifier

Page 13: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA

Formulation: Cost function

Person p appears at

least once in clip N :

p = Rick

Weak supervision

from scripts:

Page 14: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA

Action a appears at

least once in clip N :

a = Walk

Weak supervision

from scripts:

Formulation: Cost function

Page 15: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA

Formulation: Cost function

Action a

appears

in clip N :

Weak supervision

from scripts:

Person p

appears in

clip N :

Person p

and

Action a

appear in

clip N :

Page 16: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA

22

Image and video features

• Facial features

[Everingham’06]

• HOG descriptor on

normalized face image

• Dense Trajectory

features in person

bounding box

[Wang et al.,’11]

Face features

Action features

Page 17: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA

23

Results for Person Labelling

American beauty (11 character names)Casablanca (17 character names)

Page 18: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA

24

Results for Person + Action Labelling

Casablanca,

Walking

Page 19: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA

Finding Actions and Actors in Movies

[Bojanowski, Bach, Laptev, Ponce, Sivic, Schmid, 2013]

Page 20: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA

26

Action Learning with

Ordering Constraints[Bojanowski et al. ECCV 2014]

Page 21: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA

27

Action Learning with

Ordering Constraints[Bojanowski et al. ECCV 2014]

Page 22: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA

Cost Function

Weak supervision from ordering constraints on Z:

Action

label

Action

index

2

4

1

2

3

2

Video time intervals

Page 23: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA

Cost Function

Weak supervision from ordering constraints on Z:

Action

label

Action

index

2

4

1

2

3

2

Video time intervals

Page 24: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA

Cost Function

Weak supervision from ordering constraints on Z:

Action

label

Action

index

2

4

1

2

3

2

Video time intervals

Page 25: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA

Is the optimization tractable?

• Path constraints are implicit

• Cannot use off-the-shelf solvers

• Frank-Wolfe optimization algorithm

Page 26: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA

Results

937 video clips from 60 Hollywood movies•

16 action classes•

Each clip is annotated by a sequence of n actions (2≤n≤11)•

Page 27: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA
Page 28: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA

Summary

Reason about action sequences.•

Weakly-supervised learning

using time ordering constraints. •

Action learning with ordering constraints

Reason about individual people.•

Joint Learning of Actors and Actions

Weakly-supervised learning of

actions and names.•

Page 29: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA

Limitations / Future work

No spatial localization. Want to answer questions:•

- Who is doing what?

- Who interacts with whom?

Actions are modeled at short time intervals (15 frames).•

Sequences of action labels are given manually. Want to jointly

cluster videos and scripts. •

Action learning with ordering constraints

No temporal localization of actions within

person tracks.•

Joint Learning of Actors and Actions

Finding people in movies is still a big challenge.•

Extracting action labels from scripts is a major

(NLP+vision?) challenge.•