repair: removing representation bias by dataset resampling · •representation bias[3]:...

1
REPAIR: Removing Representation Bias by Dataset Resampling Yi Li and Nuno Vasconcelos Statistical Visual Computing Lab, Department of Electrical and Computer Engineering, UC San Diego Overview We present REPAIR, a resampling approach to minimizing representation bias of datasets Sensitivity analysis on video action recognition reveals some algorithms are more prone to biases than others Neural network models trained on de-biased datasets are shown to generalize better Introduction Video action classification can often be solved with static frames with no temporal information (Fig. 1) Representation bias [3]: “Preference” of dataset towards different types of features High bias — Feature representation informative for classification Problematic if feature of high bias is not supposed to be sufficient (e.g. static features for video classification) Shortcuts (visual cues) might be exploited by discriminative models (e.g. background objects, environment) Neural nets may overfit to bias specific to one dataset, producing unfair decisions and failing to generalize Figure 1: Video snapshots of Kinetics [1], easily giving away the true action classes. No temporal reasoning needed here. Our goals: Develop an algorithm (REPAIR) to reduce static bias of datasets Re-evaluate action recognition models in the absence of bias Improve generalization of networks using REPAIRed training set References [1]Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, et al. The Kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017. [2] Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, et al. HMDB: a large video database for human motion recognition. In ICCV, pages 2556–2563, 2011. [3]Yingwei Li, Yi Li, and Nuno Vasconcelos. RESOUND: Towards action recognition without representation bias. In ECCV, pages 513–528, 2018. [4] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. REP resentA tion bI as R emoval (REPAIR) Formulation Bias of dataset D = { ( X , Y ) } towards representation φ is B ( D , φ )= 1 - R * ( D , φ ) H (Y ) (1) with linear classification risk and label entropy R * ( D , φ )= min θ E X ,Y [ - log p θ (Y | φ ( X ))] H (Y )= E X ,Y [ - log p (Y )] min θ - 1 |D| |D| i =1 log p θ ( y i | φ ( x i )) ≈- 1 |D| |D| i =1 log p y i Low risk R * ( D , φ ) 0 = φ informative for solving dataset D , hence higher bias High risk R * ( D , φ ) ↑H (Y )= φ provides little information about label Y , hence lower bias Goal: Obtain a new dataset D 0 derived from D with reduced bias Dataset resampling Weight each example ( x i , y i ) ∈D by its probability w i of being selected Minimize reweighted bias B ( D 0 w , φ )= 1 - R * ( D 0 w , φ ) H (Y 0 w ) with R * ( D 0 w , φ )= min θ - |D| i =1 w i i w i log p θ ( y i | φ ( x i )) H (Y 0 w )= - |D| i =1 w i i w i log p 0 y i p 0 y = - i : y i =y w i i w i Leads to solving minimax problem with adversarial training min w max θ V ( w, θ )= 1 - i w i log p θ ( y i | φ ( x i )) i w i log p 0 y i (2) Classifier θ tries to classify examples in feature space φ Weights w tries to select difficult set of examples Colored MNIST Experiment setup Introduce color bias to MNIST dataset by digit-dependent coloring (Fig. 2) X color i , j , c = S c · X i , j , i , j ∈{0, . . . , 27}, c ∈{0, 1, 2} (3) Augment original grayscale images X i , j [ 0, 1] with RGB color S =( S 0 , S 1 , S 2 )= φ ( X color ) New dataset is biased if color S dependent on class label Y (e.g. Gaussian with different mean per-class) Resampling the Digits Resampling strategies Selecting examples based on w i threshold — Keep ( x i , y i ) with w i t rank — Keep ratio of ( x i , y i ) with greatest w i cls-rank — Keep ( x i , y i ) with greatest w i each class sample — Keep ( x i , y i ) with probability w i uniform (baseline) — Pick 50% of ( x i , y i ) at random Figure 2: Colored MNIST examples before & after resampling. Figure 3: Bias and generalization accuracy after resampling. Video Action Recognition Static bias φ — ImageNet pre-trained CNN features High bias = More static cue Minimize bias = Emphasis on temporal modeling REPAIRed dataset UCF101 [4], HMDB51 [2], Kinetics [1] Videos that contained too many visual cues, e.g. Billiards, are discarded (Fig. 4) Remaining examples are difficult for spatial CNN Some video CNN models rely heavily on static bias, some less (Fig. 5) (a) UCF101. (b) HMDB51. (c) Kinetics. Figure 4: Videos with highest/lowest resampling weights w i . Figure 5: Algorithm performances on resampled dataset. Generalization Training models on REPAIRed dataset Overfitting to the bias may hurt generalization 3D CNN models trained on REPAIRed Kinetics dataset generalize better to same classes in HMDB51 (Table 1) Remove ratio 0 (orig.) 0.25 0.5 0.75 Static bias 0.585 0.499 0.400 0.297 sword 12.43% 15.52% 16.99% 22.03% hug 14.97% 16.26% 17.37% 17.11% somersault 23.06% 23.97% 26.67% 29.26% laugh 37.15% 56.09% 49.51% 50.42% clap 53.47% 52.79% 52.31% 45.92% shake hands 57.80% 60.31% 60.41% 61.40% kiss 80.59% 80.87% 79.20% 78.96% smoke 83.31% 80.87% 82.35% 83.29% pushup 90.70% 88.12% 90.16% 87.26% situp 90.39% 91.67% 88.09% 92.14% ride bike 93.89% 94.94% 93.60% 91.02% pullup 100.00% 100.00% 100.00% 100.00% Average 61.48% 63.45% 63.06% 63.24% Table 1: Cross-dataset generalization from REPAIRed Kinetics to HMDB51.

Upload: others

Post on 06-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: REPAIR: Removing Representation Bias by Dataset Resampling · •Representation bias[3]: “Preference” of dataset towards different types of features •High bias — Feature representation

REPAIR: Removing Representation Bias by Dataset ResamplingYi Li and Nuno Vasconcelos

Statistical Visual Computing Lab, Department of Electrical and Computer Engineering, UC San Diego

Overview

• We present REPAIR, a resampling approach tominimizing representation bias of datasets

• Sensitivity analysis on video action recognition revealssome algorithms are more prone to biases than others

• Neural network models trained on de-biased datasetsare shown to generalize better

Introduction

• Video action classification can often be solved with staticframes with no temporal information (Fig. 1)

• Representation bias [3]: “Preference” of dataset towardsdifferent types of features• High bias — Feature representation informative for classification• Problematic if feature of high bias is not supposed to be sufficient

(e.g. static features for video classification)• Shortcuts (visual cues) might be exploited by discriminative models

(e.g. background objects, environment)

• Neural nets may overfit to bias specific to one dataset,producing unfair decisions and failing to generalize

crawling baby feeding goats snowmobiling

using computer catching or throwing softball welding

Figure 1: Video snapshots of Kinetics [1], easily giving away the true actionclasses. No temporal reasoning needed here.

• Our goals:• Develop an algorithm (REPAIR) to reduce static bias of datasets• Re-evaluate action recognition models in the absence of bias• Improve generalization of networks using REPAIRed training set

References

[1] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, et al.The Kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017.

[2] Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, et al.HMDB: a large video database for human motion recognition.In ICCV, pages 2556–2563, 2011.

[3] Yingwei Li, Yi Li, and Nuno Vasconcelos.RESOUND: Towards action recognition without representation bias.In ECCV, pages 513–528, 2018.

[4] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah.UCF101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012.

REPresentAtion bIas Removal (REPAIR)

Formulation Bias of dataset D = {(X, Y)} towards representation φ is

B(D, φ) = 1−R∗(D, φ)

H(Y)(1)

with linear classification risk and label entropy

R∗(D, φ) = minθ

EX,Y[− log pθ(Y | φ(X))] H(Y) = EX,Y[− log p(Y)]

≈ minθ− 1|D|

|D|

∑i=1

log pθ(yi | φ(xi)) ≈ − 1|D|

|D|

∑i=1

log pyi

• Low riskR∗(D, φ) ↓ 0 =⇒ φ informative for solving dataset D, hence higher bias• High riskR∗(D, φ) ↑ H(Y) =⇒ φ provides little information about label Y, hence lower bias• Goal: Obtain a new dataset D′ derived from D with reduced bias

Dataset resampling Weight each example (xi, yi) ∈ D by its probability wi of being selected

• Minimize reweighted bias B(D′w, φ) = 1−R∗(D′w, φ)

H(Y′w)with

R∗(D′w, φ) = minθ−|D|

∑i=1

wi

∑i wilog pθ(yi | φ(xi))

H(Y′w) = −|D|

∑i=1

wi

∑i wilog p′yi

p′y = −∑i:yi=y wi

∑i wi• Leads to solving minimax problem with adversarial training

minw

maxθV(w, θ) = 1− ∑i wi log pθ(yi | φ(xi))

∑i wi log p′yi

(2)

• Classifier θ tries to classify examples in feature space φ• Weights w tries to select difficult set of examples

Colored MNIST

Experiment setup Introduce color bias to MNIST dataset by digit-dependent coloring (Fig. 2)

Xcolori,j,c = Sc · Xi,j, i, j ∈ {0, . . . , 27}, c ∈ {0, 1, 2} (3)

• Augment original grayscale images Xi,j ∈ [0, 1] with RGB color S = (S0, S1, S2) = φ(Xcolor)• New dataset is biased if color S dependent on class label Y (e.g. Gaussian with different mean per-class)

Resampling the Digits

Resampling strategies Selecting examples based on wi• threshold — Keep (xi, yi) with wi ≥ t• rank — Keep ratio of (xi, yi) with greatest wi• cls-rank — Keep (xi, yi) with greatest wi each class• sample — Keep (xi, yi) with probability wi• uniform (baseline) — Pick 50% of (xi, yi) at random

Figure 2: Colored MNIST examples before & after resampling.

0.0

0.2

0.4

0.6

0.8

Bias

0.10 0.15 0.20 0.25 0.30 0.35 0.40Color deviation

60

70

80

90

100

Accu

racy

%

Resamplingoriginalthresholdrankcls_ranksampleuniform

Figure 3: Bias and generalization accuracy after resampling.

Video Action Recognition

Static bias φ — ImageNet pre-trained CNN features• High bias =⇒ More static cue• Minimize bias =⇒ Emphasis on temporal modeling

REPAIRed dataset UCF101 [4], HMDB51 [2], Kinetics [1]• Videos that contained too many visual cues, e.g.

Billiards, are discarded (Fig. 4)• Remaining examples are difficult for spatial CNN• Some video CNN models rely heavily on static bias,

some less (Fig. 5)

PlayingFlute: 0.974

Billiards: 0.085

(a) UCF101.

push: 0.977

pullup: 0.041

(b) HMDB51.

flying kite: 0.955

playing harp: 0.056

(c) Kinetics.

Figure 4: Videos with highest/lowest resampling weights wi.

40 60 80 100Dataset size %

40

50

60

70

80

90

Accu

racy

%

UCF101

40 60 80 100Dataset size %

20

30

40

50

60

Accu

racy

%

KineticsModelI3DC2DTSNSamplingREPAIRRandom

Figure 5: Algorithm performances on resampled dataset.

Generalization Training models on REPAIRed dataset• Overfitting to the bias may hurt generalization• 3D CNN models trained on REPAIRed Kinetics dataset

generalize better to same classes in HMDB51 (Table 1)

Remove ratio 0 (orig.) 0.25 0.5 0.75Static bias 0.585 0.499 0.400 0.297

sword 12.43% 15.52% 16.99% 22.03%hug 14.97% 16.26% 17.37% 17.11%

somersault 23.06% 23.97% 26.67% 29.26%laugh 37.15% 56.09% 49.51% 50.42%clap 53.47% 52.79% 52.31% 45.92%

shake hands 57.80% 60.31% 60.41% 61.40%kiss 80.59% 80.87% 79.20% 78.96%

smoke 83.31% 80.87% 82.35% 83.29%pushup 90.70% 88.12% 90.16% 87.26%situp 90.39% 91.67% 88.09% 92.14%

ride bike 93.89% 94.94% 93.60% 91.02%pullup 100.00% 100.00% 100.00% 100.00%

Average 61.48% 63.45% 63.06% 63.24%

Table 1: Cross-dataset generalization from REPAIRed Kinetics to HMDB51.