self-supervised learning for visual recognition

Self-supervised Learning for Visual Recognition

Hamed Pirsiavash

University of Maryland, Baltimore County

1

Significant progress in recognition due to large annotated datasets

14 million images

10 million images

450 hours of video

1.7 million question/answers

Self supervised learning

3

Zhang et al. ECCV’16

Input Output

Chair: 0

Dog: 1

Car: 0.

.

.

Supervised Learning(classification)

Input image

4

Label


Input image

5

Chair: 0

Dog: 1

Car: 0.

.

.

Label

Chair: 1

Dog: 0

Car: 0.

.

.


Input image

Label

6

Chair: 1

Dog: 0

Car: 0.

.

.


Input image

Label

7

Transfer to another task

Supervised Learning(counting)

Input image

8

Chair: 0

Dog: 2

Car: 0.

.

.

Label

9

Inference on counting network

10

Constraint in the output

11


12


13

Two constraints in learning

Annotation...

14

Two constraints in learning

Annotation...

0

4.5

0

4.5

0

4.5

0

4.5

D ◦ x

T1 ◦ x

T2 ◦ x

T3 ◦ x

T4 ◦ x

0

4.5

+

y

x

D ◦ y

φ

φ

φ

φ

φ

0

4.5φ

max{ 0, M − |c − t |2}

|d − t |2

t

t

d

c


15

0

4.5

0

4.5

0

4.5

0

4.5T1 ◦ x

T2 ◦ x

T3 ◦ x

T4 ◦ x

0

4.5

+

y

xφ

φ

φ

φ

φ

0

4.5φ

max{ 0, M − |c − t |2}

|d − t |2

t

t

d

c

16


0

4.5

0

4.5

0

4.5

0

4.5

D ◦ x

T1 ◦ x

T2 ◦ x

T3 ◦ x

T4 ◦ x

0

4.5

+

y

x

D ◦ y

φ

φ

φ

φ

φ

0

4.5φ

max{ 0, M − |c − t |2}

|d − t |2

t

t

d

c

17


0

4.5

0

4.5

0

4.5

0

4.5

D ◦ x

T1 ◦ x

T2 ◦ x

T3 ◦ x

T4 ◦ x

0

4.5

+

y

x

D ◦ y

φ

φ

φ

φ

φ

0

4.5φ

max{ 0, M − |c − t |2}

|d − t |2

t

t

d

c

18


0

4.5

0

4.5

0

4.5

0

4.5

D ◦ x

T1 ◦ x

T2 ◦ x

T3 ◦ x

T4 ◦ x

0

4.5

+

y

x

D ◦ y

φ

φ

φ

φ

φ

0

4.5φ

max{ 0, M − |c − t |2}

19


0

4.5

0

4.5

0

4.5

0

4.5

D ◦ x

T1 ◦ x

T2 ◦ x

T3 ◦ x

T4 ◦ x

0

4.5

+

y

x

D ◦ y

φ

φ

φ

φ

φ

0

4.5φ

max{ 0, M − |c − t |2}

|d − t |2

t

d

20


0

4.5

0

4.5

0

4.5

0

4.5

D ◦ x

T1 ◦ x

T2 ◦ x

T3 ◦ x

T4 ◦ x

0

4.5

+

y

x

D ◦ y

φ

φ

φ

φ

φ

0

4.5φ

max{ 0, M − |c − t |2}

|d − t |2

t

t

d

c

21


Trained on ImageNet without annotation

22

Unit 1

Unit 2

Unit 3

Images with largest activation

Trained on COCO without annotation

23

Unit 1

Unit 2

Unit 3

Images with largest activation

Trained on ImageNet without annotation

24

query retrieved

Nearest neighbor search

Trained on COCO without annotation

25

query retrieved

Nearest neighbor search

26

Feature network(e.g., AlexNet)

Pretext task(e.g., counting)

Dataset (no labels)

27

Fine-tuning



Target task(e.g., object detection)

Dataset (no labels)

Dataset (with labels)Feature network

(e.g., AlexNet)

Method Class. Det. Segm.

Supervised 79.9 57.1 48.0

Random 53.3 43.4 19.8

Sound 54.4 44.0 -

Video 63.1 47.2 -

Split-Brain 67.1 46.7 36.0

Watching-Objects 61.0 52.2 -

Jigsaw(new version) 67.6 53.2 37.6

Counting (Ours) 67.7 52.4 36.6

Fine-tuning on PASCAL VOC07

28

Results on transfer learning

Doerch et al. ICCV’15 Noroozi and Favaro ECCV’16

Zhang et al. ECCV’16 Pathak et al. CVPR’16

Wang and Gupta ICCV’15 Pathak et al. CVPR’17

Jayaraman and Grauman ICCV’15

Agrawal et al. ICCV’15

Owens et al. ECCV’16

Mirsa et al.ECCV’16 29

Agenda

• Self supervised learning by counting

• Boosting self-supervised learning by knowledge transfer

35

36

Fine-tuning




Dataset (no labels)


(e.g., AlexNet)

37

Fine-tuning




(e.g., AlexNet)

More complicated Pretext task

Larger Dataset (no labels)

38

More complicatedFeature network

(e.g., VGG)




(e.g., AlexNet)


39

Transferring


(e.g., VGG)




(e.g., AlexNet)


40

Transferring


(e.g., VGG)




(e.g., AlexNet)


41


(e.g., VGG)




Dataset (with labels)

42


(e.g., VGG)


Dataset (no labels)



43


(e.g., VGG)


Dataset (no labels)


Pseudo labels


44


(e.g., VGG)


Dataset (no labels)

Dataset (with labels) Fine-tuning

Pseudo labels


45

Jigsaw

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

CVPR

#1024

CVPR

#1024CVPR 2018 Submission #1024. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

(a) Self-Super vised Learning Pre-Training. Suppose that

we are given a pretext task, a model and a dataset. Our first

step in SSL is to train our model on the pretext task with

the given dataset (see Fig. 2 (a)). Typically, the models of

choiceareconvolutional neural networks, and oneconsiders

as feature the output of some intermediate layer (shown as

a grey rectangle in Fig. 2 (a)).

(b) Cluster ing. Our next step is to compute feature vectors

for all the images in our dataset. Then, we use the k-means

algorithm with theEuclidean distance to cluster thefeatures

(see Fig. 2 (b)). Ideally, when performing this clustering on

ImageNet images, wewant theclusterscenters to bealigned

with object categories. In the experiments, we typically use

2,000 clusters.

(c) Extracting Pseudo-Labels. The cluster centers com-

puted in the previous section can be considered as virtual

categories. Indeed, we can assign feature vectors to the

closest cluster center to determineapseudo-label associated

to the chosen cluster. This operation is illustrated in Fig. 2

(c). Notice that the dataset used in this operation might be

different from that used in the clustering step or in the SSL

pre-training.

(d) Cluster Classification. Finally, we train a simple clas-

sifier using the architecture of the target task so that, given

an input image (from thedataset used to extract thepseudo-

labels), predicts the corresponding pseudo-label (see Fig. 2

(d)). Thisclassifier learns anew representation in the target

architecture that maps images that were originally close to

each other in the pre-trained feature space to close points.

4. The Jigsaw++ Pretext Task

Recent work [7, 31] has shown that deeper architec-

tures can help in SSL with PASCAL recognition tasks (e.g.,

ResNet). However, those methods use the same deep ar-

chitecture for both SSL and fine-tuning. Hence, they are

not comparable with previous methods that use a simpler

AlexNet architecture in fine-tuning. We are interested in

knowing how far one can improve the SSL pre-training of

AlexNet for PASCAL tasks. Since in our framework the

SSL task is not restricted to use the same architecture as in

the final supervised task, we can increase the difficulty of

theSSL task along with thecapacity of thearchitecture and

still use AlexNet at the fine-tuning stage.

To this aim, we extend the jigsaw [20] task and call it

the jigsaw++ task. The original pretext task [20] is to find

a reordering of tiles from a 3⇥ 3 grid of a square region

cropped from an image. In jigsaw++, we replace a random

number of tiles in the grid (up to 2) with (occluding) tiles

from another random image (see Fig. 3). The number of

tiles (0, 1 or 2 in our experiments) as well as their location

are randomly selected. The occluding tiles make the task

remarkably more complex. First, the model needs to detect

(a)

(c)

(b)

(d)

Figure 3: The j igsaw++ task. (a) the main image. (b) a

random image. (c) a puzzle from the original formulation

of [20], where all tiles come from the same image. (d) a

puzzle in the jigsaw++ task, where at most 2 tiles can come

from a random image.

the occluding tiles and second, it needs to solve the jigsaw

problem by using only theremaining patches. To makesure

we are not adding ambiguities to the task, we remove sim-

ilar permutations so that the minimum Hamming distance

between any two permutations is at least 3. In this way,

there is a unique solution to the jigsaw task for any num-

ber of occlusions in our training setting. Our final training

permutation set includes 701 permutations, in which theav-

erage and minimum Hamming distance is .86 and 3 respec-

tively. In addition to themean and std normalization of each

patch independently, as it wasdonein theoriginal paper, we

train thenetwork 70% of the time on thegray scale images.

In this way, we prevent the network from using low level

statistics to detect occlusions and solve the jigsaw task.

Wetrain the jigsaw++ task on both VGG16 and AlexNet

architectures. By having a larger capacity with VGG16, the

network isbetter equipped to handle theincreased complex-

ity of the jigsaw++ task and is capable of extracting better

representations from the data.

Following our pipeline in Fig. 2, we train our models

4

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

CVPR

#1024

CVPR








agrey rectangle in Fig. 2 (a)).







2,000 clusters.








pre-training.






























(a)

(c)

(b)

(d)


























4

Permute and then predict the permutation

Noroozi, Mehdi, and Paolo Favaro. "Unsupervised learning of visual representations by solving jigsaw puzzles." ECCV 2016.

46

Jigsaw++324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

CVPR

#1024

CVPR















2,000 clusters.








pre-training.






























(a)

(c)

(b)

(d)


























4

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

CVPR

#1024

CVPR















2,000 clusters.








pre-training.





(d)). This classifier learns anew representation in the target





turescan help in SSL with PASCAL recognition tasks (e.g.,










still use AlexNet at thefine-tuning stage.










(a)

(c)

(b)

(d)
















patch independently, asit wasdonein theoriginal paper, we










4

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

CVPR

#1024

CVPR














with object categories. In the experiments, we typical ly use

2,000 clusters.








pre-training.










turescan help in SSL with PASCAL recognition tasks (e.g.,




















(a)

(c)

(b)

(d)

















train the network 70% of the time on thegray scale images.





network isbetter equipped to handletheincreased complex-




4

• Add distracting patches

• Increase number of permutations

47

Clusters on Jigsaw++



Random 53.3 43.4 19.8

Sound 54.4 44.0 -

Video 63.1 47.2 -

Split-Brain 67.1 46.7 36.0





50




Random 53.3 43.4 19.8

Sound 54.4 44.0 -

Video 63.1 47.2 -

Split-Brain 67.1 46.7 36.0




Jigsaw++ (Ours) 72.5 56.5 42.6


51




Random 53.3 43.4 19.8

Sound 54.4 44.0 -

Video 63.1 47.2 -

Split-Brain 67.1 46.7 36.0




Jigsaw++ (Ours) 72.5 56.5 42.6

RotNet (ICLR’18) 72.9 54.4 39.1

Deep clustering (ECCV’18) 73.7 55.4 45.1


52


53


(e.g., VGG)


Dataset (no labels)


Pseudo labels


54


Dataset (no labels)


Pseudo labels

HOG



Random 53.3 43.4 19.8

Sound 54.4 44.0 -

Video 63.1 47.2 -

Split-Brain 67.1 46.7 36.0



Counting (ours) 67.7 52.4 36.6

Jigsaw++ (ours) 72.5 56.5 42.6

HOG (ours) 70.2 53.2 39.2


55


Kaiming He Ross Girshick Piotr Dollar, “Rethinking ImageNet Pre-training”, arXiv, Nov 2018.

Visualization of conv1 filters

56

From scratch

CC on VGG-Jigsaw++

CC onHOG

57

Thanks to

Mehdi Noroozi Paolo FavaroAnanth Kavalkazhani

58

Thanks!

self-supervised learning for visual recognition

Documents