self-supervised learning for visual recognition

56
Self-supervised Learning for Visual Recognition Hamed Pirsiavash University of Maryland, Baltimore County 1

Upload: others

Post on 28-May-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Self-supervised Learning for Visual Recognition

Self-supervised Learning for Visual Recognition

Hamed Pirsiavash

University of Maryland, Baltimore County

1

Page 2: Self-supervised Learning for Visual Recognition

Significant progress in recognition due to large annotated datasets

14 million images

10 million images

450 hours of video

1.7 million question/answers

Page 3: Self-supervised Learning for Visual Recognition

Self supervised learning

3

Zhang et al. ECCV’16

Input Output

Page 4: Self-supervised Learning for Visual Recognition

Chair: 0

Dog: 1

Car: 0.

.

.

Supervised Learning(classification)

Input image

4

Label

Page 5: Self-supervised Learning for Visual Recognition

Supervised Learning(classification)

Input image

5

Chair: 0

Dog: 1

Car: 0.

.

.

Label

Page 6: Self-supervised Learning for Visual Recognition

Chair: 1

Dog: 0

Car: 0.

.

.

Supervised Learning(classification)

Input image

Label

6

Page 7: Self-supervised Learning for Visual Recognition

Chair: 1

Dog: 0

Car: 0.

.

.

Supervised Learning(classification)

Input image

Label

7

Transfer to another task

Page 8: Self-supervised Learning for Visual Recognition

Supervised Learning(counting)

Input image

8

Chair: 0

Dog: 2

Car: 0.

.

.

Label

Page 9: Self-supervised Learning for Visual Recognition

9

Inference on counting network

Page 10: Self-supervised Learning for Visual Recognition

10

Constraint in the output

Page 11: Self-supervised Learning for Visual Recognition

11

Constraint in the output

Page 12: Self-supervised Learning for Visual Recognition

12

Constraint in the output

Page 13: Self-supervised Learning for Visual Recognition

13

Two constraints in learning

Annotation...

Page 14: Self-supervised Learning for Visual Recognition

14

Two constraints in learning

Annotation...

Page 15: Self-supervised Learning for Visual Recognition

0

4.5

0

4.5

0

4.5

0

4.5

D ◦ x

T1 ◦ x

T2 ◦ x

T3 ◦ x

T4 ◦ x

0

4.5

+

y

x

D ◦ y

φ

φ

φ

φ

φ

0

4.5φ

max{ 0, M − |c − t |2}

|d − t |2

t

t

d

c

Self supervised learning

15

Page 16: Self-supervised Learning for Visual Recognition

0

4.5

0

4.5

0

4.5

0

4.5T1 ◦ x

T2 ◦ x

T3 ◦ x

T4 ◦ x

0

4.5

+

y

φ

φ

φ

φ

0

4.5φ

max{ 0, M − |c − t |2}

|d − t |2

t

t

d

c

16

Self supervised learning

Page 17: Self-supervised Learning for Visual Recognition

0

4.5

0

4.5

0

4.5

0

4.5

D ◦ x

T1 ◦ x

T2 ◦ x

T3 ◦ x

T4 ◦ x

0

4.5

+

y

x

D ◦ y

φ

φ

φ

φ

φ

0

4.5φ

max{ 0, M − |c − t |2}

|d − t |2

t

t

d

c

17

Self supervised learning

Page 18: Self-supervised Learning for Visual Recognition

0

4.5

0

4.5

0

4.5

0

4.5

D ◦ x

T1 ◦ x

T2 ◦ x

T3 ◦ x

T4 ◦ x

0

4.5

+

y

x

D ◦ y

φ

φ

φ

φ

φ

0

4.5φ

max{ 0, M − |c − t |2}

|d − t |2

t

t

d

c

18

Self supervised learning

Page 19: Self-supervised Learning for Visual Recognition

0

4.5

0

4.5

0

4.5

0

4.5

D ◦ x

T1 ◦ x

T2 ◦ x

T3 ◦ x

T4 ◦ x

0

4.5

+

y

x

D ◦ y

φ

φ

φ

φ

φ

0

4.5φ

max{ 0, M − |c − t |2}

19

Self supervised learning

Page 20: Self-supervised Learning for Visual Recognition

0

4.5

0

4.5

0

4.5

0

4.5

D ◦ x

T1 ◦ x

T2 ◦ x

T3 ◦ x

T4 ◦ x

0

4.5

+

y

x

D ◦ y

φ

φ

φ

φ

φ

0

4.5φ

max{ 0, M − |c − t |2}

|d − t |2

t

d

20

Self supervised learning

Page 21: Self-supervised Learning for Visual Recognition

0

4.5

0

4.5

0

4.5

0

4.5

D ◦ x

T1 ◦ x

T2 ◦ x

T3 ◦ x

T4 ◦ x

0

4.5

+

y

x

D ◦ y

φ

φ

φ

φ

φ

0

4.5φ

max{ 0, M − |c − t |2}

|d − t |2

t

t

d

c

21

Self supervised learning

Page 22: Self-supervised Learning for Visual Recognition

Trained on ImageNet without annotation

22

Unit 1

Unit 2

Unit 3

Images with largest activation

Page 23: Self-supervised Learning for Visual Recognition

Trained on COCO without annotation

23

Unit 1

Unit 2

Unit 3

Images with largest activation

Page 24: Self-supervised Learning for Visual Recognition

Trained on ImageNet without annotation

24

query retrieved

Nearest neighbor search

Page 25: Self-supervised Learning for Visual Recognition

Trained on COCO without annotation

25

query retrieved

Nearest neighbor search

Page 26: Self-supervised Learning for Visual Recognition

26

Feature network(e.g., AlexNet)

Pretext task(e.g., counting)

Dataset (no labels)

Page 27: Self-supervised Learning for Visual Recognition

27

Fine-tuning

Feature network(e.g., AlexNet)

Pretext task(e.g., counting)

Target task(e.g., object detection)

Dataset (no labels)

Dataset (with labels)Feature network

(e.g., AlexNet)

Page 28: Self-supervised Learning for Visual Recognition

Method Class. Det. Segm.

Supervised 79.9 57.1 48.0

Random 53.3 43.4 19.8

Sound 54.4 44.0 -

Video 63.1 47.2 -

Split-Brain 67.1 46.7 36.0

Watching-Objects 61.0 52.2 -

Jigsaw(new version) 67.6 53.2 37.6

Counting (Ours) 67.7 52.4 36.6

Fine-tuning on PASCAL VOC07

28

Results on transfer learning

Page 29: Self-supervised Learning for Visual Recognition

Doerch et al. ICCV’15 Noroozi and Favaro ECCV’16

Zhang et al. ECCV’16 Pathak et al. CVPR’16

Wang and Gupta ICCV’15 Pathak et al. CVPR’17

Jayaraman and Grauman ICCV’15

Agrawal et al. ICCV’15

Owens et al. ECCV’16

Mirsa et al.ECCV’16 29

Page 30: Self-supervised Learning for Visual Recognition

Doerch et al. ICCV’15 Noroozi and Favaro ECCV’16

Zhang et al. ECCV’16 Pathak et al. CVPR’16

Wang and Gupta ICCV’15 Pathak et al. CVPR’17

Jayaraman and Grauman ICCV’15

Agrawal et al. ICCV’15

Owens et al. ECCV’16

Mirsa et al.ECCV’16 30

Page 31: Self-supervised Learning for Visual Recognition

Doerch et al. ICCV’15 Noroozi and Favaro ECCV’16

Zhang et al. ECCV’16 Pathak et al. CVPR’16

Wang and Gupta ICCV’15 Pathak et al. CVPR’17

Jayaraman and Grauman ICCV’15

Agrawal et al. ICCV’15

Owens et al. ECCV’16

Mirsa et al.ECCV’16 31

Page 32: Self-supervised Learning for Visual Recognition

Doerch et al. ICCV’15 Noroozi and Favaro ECCV’16

Zhang et al. ECCV’16 Pathak et al. CVPR’16

Wang and Gupta ICCV’15 Pathak et al. CVPR’17

Jayaraman and Grauman ICCV’15

Agrawal et al. ICCV’15

Owens et al. ECCV’16

Mirsa et al.ECCV’16 32

Page 33: Self-supervised Learning for Visual Recognition

Doerch et al. ICCV’15 Noroozi and Favaro ECCV’16

Zhang et al. ECCV’16 Pathak et al. CVPR’16

Wang and Gupta ICCV’15 Pathak et al. CVPR’17

Jayaraman and Grauman ICCV’15

Agrawal et al. ICCV’15

Owens et al. ECCV’16

Mirsa et al.ECCV’16 33

Page 34: Self-supervised Learning for Visual Recognition

Doerch et al. ICCV’15 Noroozi and Favaro ECCV’16

Zhang et al. ECCV’16 Pathak et al. CVPR’16

Wang and Gupta ICCV’15 Pathak et al. CVPR’17

Jayaraman and Grauman ICCV’15

Owens et al. ECCV’16

Mirsa et al.ECCV’16 34

Page 35: Self-supervised Learning for Visual Recognition

Agenda

• Self supervised learning by counting

• Boosting self-supervised learning by knowledge transfer

35

Page 36: Self-supervised Learning for Visual Recognition

36

Fine-tuning

Feature network(e.g., AlexNet)

Pretext task(e.g., counting)

Target task(e.g., object detection)

Dataset (no labels)

Dataset (with labels)Feature network

(e.g., AlexNet)

Page 37: Self-supervised Learning for Visual Recognition

37

Fine-tuning

Feature network(e.g., AlexNet)

Target task(e.g., object detection)

Dataset (with labels)Feature network

(e.g., AlexNet)

More complicated Pretext task

Larger Dataset (no labels)

Page 38: Self-supervised Learning for Visual Recognition

38

More complicatedFeature network

(e.g., VGG)

Target task(e.g., object detection)

Larger Dataset (no labels)

Dataset (with labels)Feature network

(e.g., AlexNet)

More complicated Pretext task

Page 39: Self-supervised Learning for Visual Recognition

39

Transferring

More complicatedFeature network

(e.g., VGG)

Target task(e.g., object detection)

Larger Dataset (no labels)

Dataset (with labels)Feature network

(e.g., AlexNet)

More complicated Pretext task

Page 40: Self-supervised Learning for Visual Recognition

40

Transferring

More complicatedFeature network

(e.g., VGG)

Target task(e.g., object detection)

Larger Dataset (no labels)

Dataset (with labels)Feature network

(e.g., AlexNet)

More complicated Pretext task

Page 41: Self-supervised Learning for Visual Recognition

41

More complicatedFeature network

(e.g., VGG)

More complicated Pretext task

Target task(e.g., object detection)

Larger Dataset (no labels)

Dataset (with labels)

Page 42: Self-supervised Learning for Visual Recognition

42

More complicatedFeature network

(e.g., VGG)

Target task(e.g., object detection)

Dataset (no labels)

More complicated Pretext task

Dataset (with labels)

Page 43: Self-supervised Learning for Visual Recognition

43

More complicatedFeature network

(e.g., VGG)

Target task(e.g., object detection)

Dataset (no labels)

Dataset (with labels)

Pseudo labels

More complicated Pretext task

Page 44: Self-supervised Learning for Visual Recognition

44

More complicatedFeature network

(e.g., VGG)

Target task(e.g., object detection)

Dataset (no labels)

Dataset (with labels) Fine-tuning

Pseudo labels

More complicated Pretext task

Page 45: Self-supervised Learning for Visual Recognition

45

Jigsaw

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

CVPR

#1024

CVPR

#1024CVPR 2018 Submission #1024. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

(a) Self-Super vised Learning Pre-Training. Suppose that

we are given a pretext task, a model and a dataset. Our first

step in SSL is to train our model on the pretext task with

the given dataset (see Fig. 2 (a)). Typically, the models of

choiceareconvolutional neural networks, and oneconsiders

as feature the output of some intermediate layer (shown as

a grey rectangle in Fig. 2 (a)).

(b) Cluster ing. Our next step is to compute feature vectors

for all the images in our dataset. Then, we use the k-means

algorithm with theEuclidean distance to cluster thefeatures

(see Fig. 2 (b)). Ideally, when performing this clustering on

ImageNet images, wewant theclusterscenters to bealigned

with object categories. In the experiments, we typically use

2,000 clusters.

(c) Extracting Pseudo-Labels. The cluster centers com-

puted in the previous section can be considered as virtual

categories. Indeed, we can assign feature vectors to the

closest cluster center to determineapseudo-label associated

to the chosen cluster. This operation is illustrated in Fig. 2

(c). Notice that the dataset used in this operation might be

different from that used in the clustering step or in the SSL

pre-training.

(d) Cluster Classification. Finally, we train a simple clas-

sifier using the architecture of the target task so that, given

an input image (from thedataset used to extract thepseudo-

labels), predicts the corresponding pseudo-label (see Fig. 2

(d)). Thisclassifier learns anew representation in the target

architecture that maps images that were originally close to

each other in the pre-trained feature space to close points.

4. The Jigsaw++ Pretext Task

Recent work [7, 31] has shown that deeper architec-

tures can help in SSL with PASCAL recognition tasks (e.g.,

ResNet). However, those methods use the same deep ar-

chitecture for both SSL and fine-tuning. Hence, they are

not comparable with previous methods that use a simpler

AlexNet architecture in fine-tuning. We are interested in

knowing how far one can improve the SSL pre-training of

AlexNet for PASCAL tasks. Since in our framework the

SSL task is not restricted to use the same architecture as in

the final supervised task, we can increase the difficulty of

theSSL task along with thecapacity of thearchitecture and

still use AlexNet at the fine-tuning stage.

To this aim, we extend the jigsaw [20] task and call it

the jigsaw++ task. The original pretext task [20] is to find

a reordering of tiles from a 3⇥ 3 grid of a square region

cropped from an image. In jigsaw++, we replace a random

number of tiles in the grid (up to 2) with (occluding) tiles

from another random image (see Fig. 3). The number of

tiles (0, 1 or 2 in our experiments) as well as their location

are randomly selected. The occluding tiles make the task

remarkably more complex. First, the model needs to detect

(a)

(c)

(b)

(d)

Figure 3: The j igsaw++ task. (a) the main image. (b) a

random image. (c) a puzzle from the original formulation

of [20], where all tiles come from the same image. (d) a

puzzle in the jigsaw++ task, where at most 2 tiles can come

from a random image.

the occluding tiles and second, it needs to solve the jigsaw

problem by using only theremaining patches. To makesure

we are not adding ambiguities to the task, we remove sim-

ilar permutations so that the minimum Hamming distance

between any two permutations is at least 3. In this way,

there is a unique solution to the jigsaw task for any num-

ber of occlusions in our training setting. Our final training

permutation set includes 701 permutations, in which theav-

erage and minimum Hamming distance is .86 and 3 respec-

tively. In addition to themean and std normalization of each

patch independently, as it wasdonein theoriginal paper, we

train thenetwork 70% of the time on thegray scale images.

In this way, we prevent the network from using low level

statistics to detect occlusions and solve the jigsaw task.

Wetrain the jigsaw++ task on both VGG16 and AlexNet

architectures. By having a larger capacity with VGG16, the

network isbetter equipped to handle theincreased complex-

ity of the jigsaw++ task and is capable of extracting better

representations from the data.

Following our pipeline in Fig. 2, we train our models

4

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

CVPR

#1024

CVPR

#1024CVPR 2018 Submission #1024. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

(a) Self-Super vised Learning Pre-Training. Suppose that

we are given a pretext task, a model and a dataset. Our first

step in SSL is to train our model on the pretext task with

the given dataset (see Fig. 2 (a)). Typically, the models of

choiceareconvolutional neural networks, and oneconsiders

as feature the output of some intermediate layer (shown as

agrey rectangle in Fig. 2 (a)).

(b) Cluster ing. Our next step is to compute feature vectors

for all the images in our dataset. Then, we use the k-means

algorithm with theEuclidean distance to cluster thefeatures

(see Fig. 2 (b)). Ideally, when performing this clustering on

ImageNet images, wewant theclusterscenters to bealigned

with object categories. In the experiments, we typically use

2,000 clusters.

(c) Extracting Pseudo-Labels. The cluster centers com-

puted in the previous section can be considered as virtual

categories. Indeed, we can assign feature vectors to the

closest cluster center to determineapseudo-label associated

to the chosen cluster. This operation is illustrated in Fig. 2

(c). Notice that the dataset used in this operation might be

different from that used in the clustering step or in the SSL

pre-training.

(d) Cluster Classification. Finally, we train a simple clas-

sifier using the architecture of the target task so that, given

an input image (from thedataset used to extract thepseudo-

labels), predicts the corresponding pseudo-label (see Fig. 2

(d)). Thisclassifier learns anew representation in the target

architecture that maps images that were originally close to

each other in the pre-trained feature space to close points.

4. The Jigsaw++ Pretext Task

Recent work [7, 31] has shown that deeper architec-

tures can help in SSL with PASCAL recognition tasks (e.g.,

ResNet). However, those methods use the same deep ar-

chitecture for both SSL and fine-tuning. Hence, they are

not comparable with previous methods that use a simpler

AlexNet architecture in fine-tuning. We are interested in

knowing how far one can improve the SSL pre-training of

AlexNet for PASCAL tasks. Since in our framework the

SSL task is not restricted to use the same architecture as in

the final supervised task, we can increase the difficulty of

theSSL task along with thecapacity of thearchitecture and

still use AlexNet at the fine-tuning stage.

To this aim, we extend the jigsaw [20] task and call it

the jigsaw++ task. The original pretext task [20] is to find

a reordering of tiles from a 3⇥ 3 grid of a square region

cropped from an image. In jigsaw++, we replace a random

number of tiles in the grid (up to 2) with (occluding) tiles

from another random image (see Fig. 3). The number of

tiles (0, 1 or 2 in our experiments) as well as their location

are randomly selected. The occluding tiles make the task

remarkably more complex. First, the model needs to detect

(a)

(c)

(b)

(d)

Figure 3: The j igsaw++ task. (a) the main image. (b) a

random image. (c) a puzzle from the original formulation

of [20], where all tiles come from the same image. (d) a

puzzle in the jigsaw++ task, where at most 2 tiles can come

from a random image.

the occluding tiles and second, it needs to solve the jigsaw

problem by using only theremaining patches. To makesure

we are not adding ambiguities to the task, we remove sim-

ilar permutations so that the minimum Hamming distance

between any two permutations is at least 3. In this way,

there is a unique solution to the jigsaw task for any num-

ber of occlusions in our training setting. Our final training

permutation set includes 701 permutations, in which theav-

erage and minimum Hamming distance is .86 and 3 respec-

tively. In addition to themean and std normalization of each

patch independently, as it wasdonein theoriginal paper, we

train thenetwork 70% of the time on thegray scale images.

In this way, we prevent the network from using low level

statistics to detect occlusions and solve the jigsaw task.

Wetrain the jigsaw++ task on both VGG16 and AlexNet

architectures. By having a larger capacity with VGG16, the

network isbetter equipped to handle theincreased complex-

ity of the jigsaw++ task and is capable of extracting better

representations from the data.

Following our pipeline in Fig. 2, we train our models

4

Permute and then predict the permutation

Noroozi, Mehdi, and Paolo Favaro. "Unsupervised learning of visual representations by solving jigsaw puzzles." ECCV 2016.

Page 46: Self-supervised Learning for Visual Recognition

46

Jigsaw++324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

CVPR

#1024

CVPR

#1024CVPR 2018 Submission #1024. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

(a) Self-Super vised Learning Pre-Training. Suppose that

we are given a pretext task, a model and a dataset. Our first

step in SSL is to train our model on the pretext task with

the given dataset (see Fig. 2 (a)). Typically, the models of

choiceareconvolutional neural networks, and oneconsiders

as feature the output of some intermediate layer (shown as

a grey rectangle in Fig. 2 (a)).

(b) Cluster ing. Our next step is to compute feature vectors

for all the images in our dataset. Then, we use the k-means

algorithm with theEuclidean distance to cluster thefeatures

(see Fig. 2 (b)). Ideally, when performing this clustering on

ImageNet images, wewant theclusterscenters to bealigned

with object categories. In the experiments, we typically use

2,000 clusters.

(c) Extracting Pseudo-Labels. The cluster centers com-

puted in the previous section can be considered as virtual

categories. Indeed, we can assign feature vectors to the

closest cluster center to determineapseudo-label associated

to the chosen cluster. This operation is illustrated in Fig. 2

(c). Notice that the dataset used in this operation might be

different from that used in the clustering step or in the SSL

pre-training.

(d) Cluster Classification. Finally, we train a simple clas-

sifier using the architecture of the target task so that, given

an input image (from thedataset used to extract thepseudo-

labels), predicts the corresponding pseudo-label (see Fig. 2

(d)). Thisclassifier learns anew representation in the target

architecture that maps images that were originally close to

each other in the pre-trained feature space to close points.

4. The Jigsaw++ Pretext Task

Recent work [7, 31] has shown that deeper architec-

tures can help in SSL with PASCAL recognition tasks (e.g.,

ResNet). However, those methods use the same deep ar-

chitecture for both SSL and fine-tuning. Hence, they are

not comparable with previous methods that use a simpler

AlexNet architecture in fine-tuning. We are interested in

knowing how far one can improve the SSL pre-training of

AlexNet for PASCAL tasks. Since in our framework the

SSL task is not restricted to use the same architecture as in

the final supervised task, we can increase the difficulty of

theSSL task along with thecapacity of thearchitecture and

still use AlexNet at the fine-tuning stage.

To this aim, we extend the jigsaw [20] task and call it

the jigsaw++ task. The original pretext task [20] is to find

a reordering of tiles from a 3⇥ 3 grid of a square region

cropped from an image. In jigsaw++, we replace a random

number of tiles in the grid (up to 2) with (occluding) tiles

from another random image (see Fig. 3). The number of

tiles (0, 1 or 2 in our experiments) as well as their location

are randomly selected. The occluding tiles make the task

remarkably more complex. First, the model needs to detect

(a)

(c)

(b)

(d)

Figure 3: The j igsaw++ task. (a) the main image. (b) a

random image. (c) a puzzle from the original formulation

of [20], where all tiles come from the same image. (d) a

puzzle in the jigsaw++ task, where at most 2 tiles can come

from a random image.

the occluding tiles and second, it needs to solve the jigsaw

problem by using only theremaining patches. To makesure

we are not adding ambiguities to the task, we remove sim-

ilar permutations so that the minimum Hamming distance

between any two permutations is at least 3. In this way,

there is a unique solution to the jigsaw task for any num-

ber of occlusions in our training setting. Our final training

permutation set includes 701 permutations, in which theav-

erage and minimum Hamming distance is .86 and 3 respec-

tively. In addition to themean and std normalization of each

patch independently, as it wasdonein theoriginal paper, we

train thenetwork 70% of the time on thegray scale images.

In this way, we prevent the network from using low level

statistics to detect occlusions and solve the jigsaw task.

Wetrain the jigsaw++ task on both VGG16 and AlexNet

architectures. By having a larger capacity with VGG16, the

network isbetter equipped to handle theincreased complex-

ity of the jigsaw++ task and is capable of extracting better

representations from the data.

Following our pipeline in Fig. 2, we train our models

4

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

CVPR

#1024

CVPR

#1024CVPR 2018 Submission #1024. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

(a) Self-Super vised Learning Pre-Training. Suppose that

we are given a pretext task, a model and a dataset. Our first

step in SSL is to train our model on the pretext task with

the given dataset (see Fig. 2 (a)). Typically, the models of

choiceareconvolutional neural networks, and oneconsiders

as feature the output of some intermediate layer (shown as

a grey rectangle in Fig. 2 (a)).

(b) Cluster ing. Our next step is to compute feature vectors

for all the images in our dataset. Then, we use the k-means

algorithm with theEuclidean distance to cluster thefeatures

(see Fig. 2 (b)). Ideally, when performing this clustering on

ImageNet images, wewant theclusterscenters to bealigned

with object categories. In the experiments, we typically use

2,000 clusters.

(c) Extracting Pseudo-Labels. The cluster centers com-

puted in the previous section can be considered as virtual

categories. Indeed, we can assign feature vectors to the

closest cluster center to determineapseudo-label associated

to the chosen cluster. This operation is illustrated in Fig. 2

(c). Notice that the dataset used in this operation might be

different from that used in the clustering step or in the SSL

pre-training.

(d) Cluster Classification. Finally, we train a simple clas-

sifier using the architecture of the target task so that, given

an input image (from thedataset used to extract thepseudo-

labels), predicts the corresponding pseudo-label (see Fig. 2

(d)). This classifier learns anew representation in the target

architecture that maps images that were originally close to

each other in the pre-trained feature space to close points.

4. The Jigsaw++ Pretext Task

Recent work [7, 31] has shown that deeper architec-

turescan help in SSL with PASCAL recognition tasks (e.g.,

ResNet). However, those methods use the same deep ar-

chitecture for both SSL and fine-tuning. Hence, they are

not comparable with previous methods that use a simpler

AlexNet architecture in fine-tuning. We are interested in

knowing how far one can improve the SSL pre-training of

AlexNet for PASCAL tasks. Since in our framework the

SSL task is not restricted to use the same architecture as in

the final supervised task, we can increase the difficulty of

theSSL task along with thecapacity of thearchitecture and

still use AlexNet at thefine-tuning stage.

To this aim, we extend the jigsaw [20] task and call it

the jigsaw++ task. The original pretext task [20] is to find

a reordering of tiles from a 3⇥ 3 grid of a square region

cropped from an image. In jigsaw++, we replace a random

number of tiles in the grid (up to 2) with (occluding) tiles

from another random image (see Fig. 3). The number of

tiles (0, 1 or 2 in our experiments) as well as their location

are randomly selected. The occluding tiles make the task

remarkably more complex. First, the model needs to detect

(a)

(c)

(b)

(d)

Figure 3: The j igsaw++ task. (a) the main image. (b) a

random image. (c) a puzzle from the original formulation

of [20], where all tiles come from the same image. (d) a

puzzle in the jigsaw++ task, where at most 2 tiles can come

from a random image.

the occluding tiles and second, it needs to solve the jigsaw

problem by using only theremaining patches. To makesure

we are not adding ambiguities to the task, we remove sim-

ilar permutations so that the minimum Hamming distance

between any two permutations is at least 3. In this way,

there is a unique solution to the jigsaw task for any num-

ber of occlusions in our training setting. Our final training

permutation set includes 701 permutations, in which theav-

erage and minimum Hamming distance is .86 and 3 respec-

tively. In addition to themean and std normalization of each

patch independently, asit wasdonein theoriginal paper, we

train thenetwork 70% of the time on thegray scale images.

In this way, we prevent the network from using low level

statistics to detect occlusions and solve the jigsaw task.

Wetrain the jigsaw++ task on both VGG16 and AlexNet

architectures. By having a larger capacity with VGG16, the

network isbetter equipped to handle theincreased complex-

ity of the jigsaw++ task and is capable of extracting better

representations from the data.

Following our pipeline in Fig. 2, we train our models

4

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

CVPR

#1024

CVPR

#1024CVPR 2018 Submission #1024. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

(a) Self-Super vised Learning Pre-Training. Suppose that

we are given a pretext task, a model and a dataset. Our first

step in SSL is to train our model on the pretext task with

the given dataset (see Fig. 2 (a)). Typically, the models of

choiceareconvolutional neural networks, and oneconsiders

as feature the output of some intermediate layer (shown as

a grey rectangle in Fig. 2 (a)).

(b) Cluster ing. Our next step is to compute feature vectors

for all the images in our dataset. Then, we use the k-means

algorithm with theEuclidean distance to cluster thefeatures

(see Fig. 2 (b)). Ideally, when performing this clustering on

ImageNet images, wewant theclusterscenters to bealigned

with object categories. In the experiments, we typical ly use

2,000 clusters.

(c) Extracting Pseudo-Labels. The cluster centers com-

puted in the previous section can be considered as virtual

categories. Indeed, we can assign feature vectors to the

closest cluster center to determineapseudo-label associated

to the chosen cluster. This operation is illustrated in Fig. 2

(c). Notice that the dataset used in this operation might be

different from that used in the clustering step or in the SSL

pre-training.

(d) Cluster Classification. Finally, we train a simple clas-

sifier using the architecture of the target task so that, given

an input image (from thedataset used to extract thepseudo-

labels), predicts the corresponding pseudo-label (see Fig. 2

(d)). Thisclassifier learns anew representation in the target

architecture that maps images that were originally close to

each other in the pre-trained feature space to close points.

4. The Jigsaw++ Pretext Task

Recent work [7, 31] has shown that deeper architec-

turescan help in SSL with PASCAL recognition tasks (e.g.,

ResNet). However, those methods use the same deep ar-

chitecture for both SSL and fine-tuning. Hence, they are

not comparable with previous methods that use a simpler

AlexNet architecture in fine-tuning. We are interested in

knowing how far one can improve the SSL pre-training of

AlexNet for PASCAL tasks. Since in our framework the

SSL task is not restricted to use the same architecture as in

the final supervised task, we can increase the difficulty of

theSSL task along with thecapacity of thearchitecture and

still use AlexNet at the fine-tuning stage.

To this aim, we extend the jigsaw [20] task and call it

the jigsaw++ task. The original pretext task [20] is to find

a reordering of tiles from a 3⇥ 3 grid of a square region

cropped from an image. In jigsaw++, we replace a random

number of tiles in the grid (up to 2) with (occluding) tiles

from another random image (see Fig. 3). The number of

tiles (0, 1 or 2 in our experiments) as well as their location

are randomly selected. The occluding tiles make the task

remarkably more complex. First, the model needs to detect

(a)

(c)

(b)

(d)

Figure 3: The j igsaw++ task. (a) the main image. (b) a

random image. (c) a puzzle from the original formulation

of [20], where all tiles come from the same image. (d) a

puzzle in the jigsaw++ task, where at most 2 tiles can come

from a random image.

the occluding tiles and second, it needs to solve the jigsaw

problem by using only theremaining patches. To makesure

we are not adding ambiguities to the task, we remove sim-

ilar permutations so that the minimum Hamming distance

between any two permutations is at least 3. In this way,

there is a unique solution to the jigsaw task for any num-

ber of occlusions in our training setting. Our final training

permutation set includes 701 permutations, in which theav-

erage and minimum Hamming distance is .86 and 3 respec-

tively. In addition to themean and std normalization of each

patch independently, as it wasdonein theoriginal paper, we

train the network 70% of the time on thegray scale images.

In this way, we prevent the network from using low level

statistics to detect occlusions and solve the jigsaw task.

Wetrain the jigsaw++ task on both VGG16 and AlexNet

architectures. By having a larger capacity with VGG16, the

network isbetter equipped to handletheincreased complex-

ity of the jigsaw++ task and is capable of extracting better

representations from the data.

Following our pipeline in Fig. 2, we train our models

4

• Add distracting patches

• Increase number of permutations

Page 47: Self-supervised Learning for Visual Recognition

47

Clusters on Jigsaw++

Page 48: Self-supervised Learning for Visual Recognition

Method Class. Det. Segm.

Supervised 79.9 57.1 48.0

Random 53.3 43.4 19.8

Sound 54.4 44.0 -

Video 63.1 47.2 -

Split-Brain 67.1 46.7 36.0

Watching-Objects 61.0 52.2 -

Jigsaw(new version) 67.6 53.2 37.6

Counting (Ours) 67.7 52.4 36.6

Fine-tuning on PASCAL VOC07

50

Results on transfer learning

Page 49: Self-supervised Learning for Visual Recognition

Method Class. Det. Segm.

Supervised 79.9 57.1 48.0

Random 53.3 43.4 19.8

Sound 54.4 44.0 -

Video 63.1 47.2 -

Split-Brain 67.1 46.7 36.0

Watching-Objects 61.0 52.2 -

Jigsaw(new version) 67.6 53.2 37.6

Counting (Ours) 67.7 52.4 36.6

Jigsaw++ (Ours) 72.5 56.5 42.6

Fine-tuning on PASCAL VOC07

51

Results on transfer learning

Page 50: Self-supervised Learning for Visual Recognition

Method Class. Det. Segm.

Supervised 79.9 57.1 48.0

Random 53.3 43.4 19.8

Sound 54.4 44.0 -

Video 63.1 47.2 -

Split-Brain 67.1 46.7 36.0

Watching-Objects 61.0 52.2 -

Jigsaw(new version) 67.6 53.2 37.6

Counting (Ours) 67.7 52.4 36.6

Jigsaw++ (Ours) 72.5 56.5 42.6

RotNet (ICLR’18) 72.9 54.4 39.1

Deep clustering (ECCV’18) 73.7 55.4 45.1

Fine-tuning on PASCAL VOC07

52

Results on transfer learning

Page 51: Self-supervised Learning for Visual Recognition

53

More complicatedFeature network

(e.g., VGG)

Target task(e.g., object detection)

Dataset (no labels)

Dataset (with labels) Fine-tuning

Pseudo labels

More complicated Pretext task

Page 52: Self-supervised Learning for Visual Recognition

54

Target task(e.g., object detection)

Dataset (no labels)

Dataset (with labels) Fine-tuning

Pseudo labels

HOG

Page 53: Self-supervised Learning for Visual Recognition

Method Class. Det. Segm.

Supervised 79.9 57.1 48.0

Random 53.3 43.4 19.8

Sound 54.4 44.0 -

Video 63.1 47.2 -

Split-Brain 67.1 46.7 36.0

Watching-Objects 61.0 52.2 -

Jigsaw(new version) 67.6 53.2 37.6

Counting (ours) 67.7 52.4 36.6

Jigsaw++ (ours) 72.5 56.5 42.6

HOG (ours) 70.2 53.2 39.2

Fine-tuning on PASCAL VOC07

55

Results on transfer learning

Kaiming He Ross Girshick Piotr Dollar, “Rethinking ImageNet Pre-training”, arXiv, Nov 2018.

Page 54: Self-supervised Learning for Visual Recognition

Visualization of conv1 filters

56

From scratch

CC on VGG-Jigsaw++

CC onHOG

Page 55: Self-supervised Learning for Visual Recognition

57

Thanks to

Mehdi Noroozi Paolo FavaroAnanth Kavalkazhani

Page 56: Self-supervised Learning for Visual Recognition

58

Thanks!