semi-supervised semantic segmentation with …...however, training deep learning models requires a...

Semi-Supervised Semantic Segmentation with Cross-Consistency Training

Yassine Ouali Celine Hudelot Myriam Tami

Universite Paris-Saclay, CentraleSupelec, MICS, 91190, Gif-sur-Yvette, France.

{yassine.ouali,celine.hudelot,myriam.tami}@centralesupelec.fr

Abstract

In this paper, we present a novel cross-consistency

based semi-supervised approach for semantic segmenta-

tion. Consistency training has proven to be a powerful semi-

supervised learning framework for leveraging unlabeled

data under the cluster assumption, in which the decision

boundary should lie in low density regions. In this work,

we first observe that for semantic segmentation, the low

density regions are more apparent within the hidden rep-

resentations than within the inputs. We thus propose cross-

consistency training, where an invariance of the predictions

is enforced over different perturbations applied to the out-

puts of the encoder. Concretely, a shared encoder and a

main decoder are trained in a supervised manner using

the available labeled examples. To leverage the unlabeled

examples, we enforce a consistency between the main de-

coder predictions and those of the auxiliary decoders, tak-

ing as inputs different perturbed versions of the encoder’s

output, and consequently, improving the encoder’s repre-

sentations. The proposed method is simple and can eas-

ily be extended to use additional training signal, such as

image-level labels or pixel-level labels across different do-

mains. We perform an ablation study to tease apart the ef-

fectiveness of each component, and conduct extensive ex-

periments to demonstrate that our method achieves state-

of-the-art results in several datasets. Code is available at

https://github.com/yassouali/CCT

1. Introduction

In recent years, with the wide adoption of deep super-

vised learning within the computer vision community, sig-

nificant strides were made across various visual tasks yield-

ing impressive results. However, training deep learning

models requires a large amount of labeled data which ac-

quisition is often costly and time consuming. In semantic

segmentation, given how expensive and laborious the ac-

quisition of pixel-level labels is, with a cost that is 15 times

and 60 times larger than that of region-level and image-level

labels respectively [33], the need for data efficient semantic

segmentation methods is even more evident.

Labeled Example

Encoder

Supervised Loss

Pert

urb

ati

ons

Encoder

Unlabeled Example

LossUnsupervivsed

MainDecoder

MainDecoder

Aux.Decoder K

Aux.Decoder 1

Figure 1. The proposed Cross-Consistency training (CCT). For

the labeled examples, the encoder and the main decoder are trained

in a supervised manner. For the unlabeled examples, a consistency

between the main decoder’s predictions and those of the auxiliary

decoders is enforced, over different types of perturbations applied

to the inputs of the auxiliary decoders. Best viewed in color.

As a result, a growing attention is drown on deep Semi-

Supervised learning (SSL) to take advantage of a large

amount of unlabeled data and limit the need for labeled ex-

amples. The current dominant SSL methods in deep learn-

ing are consistency training [43, 29, 50, 36], pseudo label-

ing [30], entropy minimization [17] and bootstrapping [42].

Some newly introduced techniques are based on generative

modeling [28, 49].

However, the recent progress in SSL was confined to

classification tasks, and its application in semantic segmen-

tation is still limited. Dominant approaches [22, 53, 52, 31]

focus on weakly-supervised learning which principle is to

generate pseudo pixel-level labels by leveraging the weak

labels, that can then be used, together with the limited

strongly labeled examples, to train a segmentation network

in a supervised manner. Generative Adversarial Networks

(GANs) were also adapted for SSL setting [49, 23] by ex-

tending the generic GAN framework to pixel-level predic-

tions. The discriminator is then jointly trained with an ad-

versarial loss and a supervised loss over the labeled exam-

ples.

12674

Nevertheless, these approaches suffer from some limi-

tations. Weakly-supervised approaches require weakly la-

beled examples along with pixel-level labels, hence, they

do not exploit the unlabeled data to extract additional train-

ing signal. Methods based on adversarial training exploit

the unlabeled data, but can be harder to train.

To address these limitations, we propose a simple consis-

tency based semi-supervised method for semantic segmen-

tation. The objective in consistency training is to enforce

an invariance of the model’s predictions over small pertur-

bations applied to the inputs. As a result, the learned model

will be robust to such small changes. The effectiveness

of consistency training depends heavily on the behavior of

the data distribution, i.e., the cluster assumption, where the

classes must be separated by low density regions. In se-

mantic segmentation, we do not observe the presence of low

density regions separating the classes within the inputs, but

rather within the encoder’s outputs. Based on this obser-

vation, we propose to enforce the consistency over differ-

ent forms of perturbations applied to the encoder’s output.

Specifically, we consider a shared encoder and a main de-

coder that are trained using the labeled examples. To lever-

age unlabeled data, we then consider multiple auxiliary de-

coders whose inputs are perturbed versions of the output of

the shared encoder. The consistency is imposed between

the main decoder’s predictions and that of the auxiliary de-

coders (see Fig. 1). This way, the shared encoder’s repre-

sentation is enhanced by using the additional training signal

extracted from the unlabeled data. The added auxiliary de-

coders have a negligible amount of parameters compared to

the encoder. Additionally, during inference, only the main

decoder is used, reducing the computation overhead both in

training and inference.

The proposed method is simple and efficient, it is also

flexible since it can easily be extended to use additional

weak labels and pixel-level labels across different domains

in a semi-supervised domain adaption setting. With exten-

sive experiments, we demonstrate the effectiveness of our

approach on PASCAL VOC [12] in a semi-supervised set-

ting, and CityScapes, CamVid [3] and SUN [48] in a semi-

supervised domain adaption setting. We obtain competitive

results across different datasets and training settings.

Concretely, our contributions are four-fold:• We propose a cross-consistency training (CCT)

method for semi-supervised semantic segmentation,

where the invariance of the predictions is enforced over

different perturbations injected into the encoder’s out-

put.

• We propose and conduct an exhaustive study of various

types of perturbations.

• We extend our approach to use weakly-labeled data,

and exploit pixel-level labels across different domains

to jointly train the segmentation network.

• We demonstrate the effectiveness of our approach with

with an extensive and detailed experimental results, in-

cluding a comparison with the state-of-the-art, as well

as an in-depth analysis of our approach with a detailed

ablation study.

2. Related Work

Semi-Supervised Learning. Recently, many efforts

have been made to adapt classic SSL methods to deep learn-

ing, such as pseudo labeling [30], entropy minimization

[17] and graph based methods [34, 26] in order to overcome

this weakness. In this work, we focus mainly on consistency

training. We refer the reader to [5] for a detailed overview

of the field. Consistency training methods are based on the

assumption that, if a realistic form of perturbation was ap-

plied to the unlabeled examples, the predictions should not

change significantly. Favoring models with decision bound-

aries that reside in low density regions, giving consistent

predictions for similar inputs. For example, Π-Model [29]

enforces a consistency over two perturbed versions of the

inputs under different data augmentations and dropout. A

weighted moving average of either the previous predictions

(i.e., Temporal Ensembling [29]), or the model’s parame-

ters (i.e., Mean Teacher [50]), can be used to obtain more

stable predictions over the unlabeled examples. Instead of

relying on random perturbations, Virtual Adversarial Train-

ing (VAT) [36] approximates the perturbations that will alter

the model’s predictions the most.

Similarly, the proposed method enforces a consistency

of predictions between the main decoder and the auxiliary

decoders over different perturbations, that are applied to the

encoder’s outputs rather than the inputs. Our work is also

loosely related to Multi-View learning [60] and Cross-View

training [7], where each input to the auxiliary decoders can

be view as an alternate, but corrupt representation of the

unlabeled examples.

Semi-Supervised Semantic Segmentation. A signif-

icant number of approaches use a limited number pixel-

level labels together with a larger number of inexact an-

notations, e.g., region-level [47, 9] or image-level labels

[31, 61, 53, 32]. For image-level based weak-supervision,

primary localization maps are generated using class acti-

vation mapping (CAM) [61]. After refining the generated

maps, they can then be used to train a segmentation net-

work together with the available pixel-level labels in a SSL

setting.

Generative modeling can also be used for semi-

supervised semantic segmentation [49, 23] to take advan-

tage of the unlabeled examples. Under a GAN frame-

work, the discriminator’s predictions are extended over

pixel classes, and can then be jointly trained with a Cross-

Entropy loss over the labeled examples and an adversarial

loss over the whole dataset.

12675

In comparison, the proposed method exploits the un-

labeled examples by enforcing a consistency over multi-

ple perturbations on the hidden representations level. En-

hancing the encoder’s representation and the overall perfor-

mance, with a small additional cost in terms of computation

and memory requirements.

Recently, CowMix [13], a concurrent method was in-

troduced. CowMix, using MixUp [56], enforces a consis-

tency between the mixed outputs and the prediction over the

mixed inputs. In this context, CCT differs as follows: (1)

CowMix, as traditional consistency regularization methods,

applies the perturbations to the inputs, but uses MixUp as

a high-dimensional perturbation to overcome the absence

of the cluster assumption. (2) Requires multiple forward

passes though the network for one training iteration. (3)

Adapting CowMix to other settings (e.g., over multiple do-

mains, using weak labels) may require significant changes.

CCT is efficient and can easily be extended to other settings.

Domain Adaptation. In many real world cases, the

existing discrepancy between the distribution of training

data and and that of testing data will often hinder the per-

formances. Domain adaptation aims to rectify this mis-

match and tune the models for a better generalization at test

time [40]. Various generative and discriminative domain

adaptation methods have been proposed for classification

[16, 14, 15, 4] and semantic segmentation [21, 58, 44, 24]

tasks.

In this work, we show that enforcing a consistency across

different domains can push the model toward better gener-

alization, even in the extreme case of non-overlapping label

spaces.

3. Method

3.1. The cluster assumption in semantic segmentation

We start with our observation and analysis of the cluster

assumption in semantic segmentation, motivating the pro-

posal of our cross-consistency training approach. A simple

way to examine it is to estimate the local smoothness by

measuring the local variations between the value of each

pixel and its local neighbors. To this end, we compute the

average euclidean distance at each spatial location and its

8 intermediate neighbors, for both the inputs and the hid-

den representations (i.e., the ResNet’s [20] outputs of a

DeepLab v3 [6] trained on COCO [33]). For the inputs,

following [13], we compute the average distance of a patch

centered at a given spatial location and its neighbors to sim-

ulate a realistic receptive field. For the hidden representa-

tions, we first upsample the feature map to the input size,

and then compute the average distance between the neigh-

boring activations (2048-dimensional feature vectors). The

results are illustrated in Fig. 2. We observe that the clus-

ter assumption is violated at the input level, given that the

low density regions do not align with the class boundaries.

On the contrary, for the encoder’s outputs, the cluster as-

sumption is maintained where the class boundaries have

high average distance, thus corresponding to low density re-

gions. This observation motivates the following approach,

in which the perturbations are applied to the encoder’s out-

puts rather than the inputs.

3.2. CrossConsistency Training for semantic segmentation

3.2.1 Problem Definition

In SSL, we are provided with a small set of labeled train-

ing examples and a larger set of unlabeled training exam-

ples. Let Dl = {(xl1, y1), . . . , (x

ln, yn)} represent the n

labeled examples and Du = {xu1 , . . . ,x

um} represent the

m unlabeled examples, with xui as the i-th unlabeled input

image, and xli as the i-th labeled input image with spatial

dimensions H ×W and its corresponding pixel-level label

yi ∈ RC×H×W , where C is the number of classes.

As discussed in the introduction, the objective is to ex-

ploit the larger number of unlabeled examples (m ≫ n) to

train a segmentation network f , to perform well on the test

data drawn from the same distribution as the training data.

In this work, our architecture (see Fig. 3) is composed of

a shared encoder h and a main decoder g, which constitute

the segmentation network f = g◦h. We also introduce a set

of K auxiliary decoders gka , with k ∈ [1,K]. While the seg-

mentation network f is trained on the labeled set Dl in a tra-

ditional supervised manner, the auxiliary networks gka◦h are

trained on the unlabeled set Du by enforcing a consistency

of predictions between the main decoder and the auxiliary

decoders. Each auxiliary decoder takes as input a perturbed

version of the encoder’s output, and the main encoder is fed

the uncorrupted intermediate representation. This way, the

representation learning of the encoder h is further enhanced

using the unlabeled examples, and subsequently, that of the

segmentation network f .

3.2.2 Cross-Consistency Training

As stated above, to extract additional training signal from

the unlabeled set Du, we rely on enforcing a consistency be-

tween the outputs of the main decoder gm and those of aux-

iliary decoders gka . Formally, for a labeled training example

xli, and its pixel-level label yi, the segmentation network f

is trained using a Cross-Entropy (CE) based supervised loss

Ls:

Ls =1

|Dl|

∑

xl

i,yi∈Dl

H(yi, f(xli)) (1)

with H(., .) as the CE. For an unlabeled example xui , an

intermediate representation of the input is computed using

12676

(a)

(b)

(c)

(d)

Figure 2. The cluster assumption in semantic segmentation. (a)

Examples from PASCAL VOC 2012 train set. (b) Pixel-level la-

bels. (c) Input level. The average euclidean distance between each

patch of size 20× 20 centered at a given spatial location extracted

from the input images, and its 8 neighboring patches. (d) Hidden

representations level. The average euclidean distance between a

given 2048-dimensional activation at each spatial location and its

8 neighbors. Darkest regions indicate high average distance.

the shared encoder zi = h(xui )

∗. Let us consider R stochas-

tic perturbations functions, denoted as pr with r ∈ [1, R],where one perturbation function can be assigned to multiple

auxiliary decoders. With various perturbation settings, we

generate K perturbed versions zki of the intermediate repre-

sentation zi, so that the k-th perturbed version is to be fed to

the k-th auxiliary decoder. For consistency, we consider the

perturbation function as part of the auxiliary decoder, (i.e.,

gka can be seen as gka ◦ pr). The training objective is then

to minimize the unsupervised loss Lu, which measures the

discrepancy between the main decoder’s output and that of

the auxiliary decoders:

Lu =1

|Du|

1

K

∑

xu

i∈Du

K∑

k=1

d(g(zi), gka(zi)) (2)

with d(., .) as a distance measure between two output prob-

ability distributions (i.e., the outputs of a softmax func-

tion applied over the channel dimension). In this work, we

choose to use mean squared error (MSE) as a distance mea-

sure.

The combined loss L for consistency based SSL is then

computed as:

L = Ls + ωuLu (3)

∗ Throughout the paper, z always refers to the output of the encoder

corresponding to an unlabeled input image xu.

SharedEncoder

Main Decoder

Aux.Decoders

Perturbations

Labeled Example

Unlabeled Example

Figure 3. Illustration of our approach. For one training iteration,

we sample a labeled input image xl and its pixel-level label y to-

gether with an unlabeled image xu. We pass both images through

the encoder and main decoder, obtaining two main predictions yl

and yu. We compute the supervised loss using the pixel-level la-

bel y and yl. We apply various perturbations to z, the output of the

encoder for xu, and generate auxiliary predictions y(i)a using the

perturbed versions z(i). The unsupervised loss is then computed

between the outputs of the auxiliary decoders and that of the main

decoder.

where ωu is an unsupervised loss weighting function. Fol-

lowing [29], to avoid using the initial noisy predictions of

the main encoder, ωu ramps up starting from zero along a

Gaussian curve up to a fixed weight λu. Concretely, at each

training iteration, an equal number of examples are sampled

from the labeled Dl and unlabeled Du sets. The supervised

loss is computed using the main encoder’s output and pixel-

level labels. For the unlabeled examples, we compute the

MSE between the prediction of each auxiliary decoder and

that of the main decoder. The total loss is then compute and

back-propagated to train the segmentation network f and

the auxiliary networks gka ◦ h. Note that the unsupervised

loss Lu is not back-propagated through the main-decoder g,

only the labeled examples are used to train g.

3.2.3 Perturbation functions

An important factor in consistency training is the pertur-

bations to apply to the hidden representation, i.e., the en-

coder’s output z. We propose three types of perturbation

functions pr: feature based, prediction based and random.

Feature based perturbations. They consist of either

injecting noise into or dropping some of the activations of

encoder’s output feature map z.

• F-Noise: we uniformly sample a noise tensor N ∼U(−0.3, 0.3) of the same size as z. After adjusting its

amplitude by multiplying it with z, the noise is then

injected into the encoder’s output z to get z = (z ⊙N) + z. This way, the injected noise is proportional to

each activation.

12677

• F-Drop: we first uniformly sample a threshold γ ∼U(0.6, 0.9). After summing over the channel dimen-

sion and normalizing the feature map z to get z′, we

generate a mask Mdrop = {z′ < γ}1†, which is then

used to obtain the perturbed version z = z ⊙ Mdrop.

This way, we mask 10% to 40% of the most active re-

gions in the feature map.

Prediction based perturbations. They consist of

adding perturbations based on the main decoder’s predic-

tion y = g(z) or that of the auxiliary decoders. We con-

sider masking based perturbations (Con-Msk, Obj-Mskand G-Cutout) in addition to adversarial perturbations

(I-VAT).

• Guided Masking: Given the importance of context re-

lationships for complex scene understanding [37], the

network might be too reliant on these relationships. To

limit them, we create two perturbed versions of z by

masking the detected objects (Obj-Msk) and the con-

text (Con-Msk). Using y, we generate an object mask

Mobj to mask the detected foreground objects and a

context mask Mcon = 1 − Mobj, which are then ap-

plied to z.

• Guided Cutout (G-Cutout): in order to reduce the re-

liance on specific parts of the objects, and inspired by

Cutout [11] that randomly masks some parts of the

input image, we first find the possible spatial extent

(i.e., bounding box) of each detected object using y.

We then zero-out a random crop within each object’s

bounding box from the corresponding feature map z.

• Intermediate VAT (I-VAT): to further push the out-

put distribution to be isotropically smooth around each

data point, we investigate using VAT [36] as a pertur-

bation function to be applied to z instead of the unla-

beled inputs. For a given auxiliary decoder, we find

the adversarial perturbation radv that will alter its pre-

diction the most. The noise is then injected into z to

obtain the perturbed version z = radv + z.

Random perturbations. (DropOut) Spatial dropout

[51] is also applied to z as a random perturbation.

3.2.4 Practical considerations

A each training iteration, we sample an equal number of la-

beled and unlabeled samples. As a consequence, we iterate

on the set Dl more times than on its unlabeled counterpart

Du, thus risking an overfitting of the labeled set Dl.

Avoiding Overfitting. Motivated by [41] who observed

improved results by sampling only 6% of the hardest pix-

els, and [54] who showed an improvement when gradually

†{condition}1 is a boolean function outputting 1 if the condition is

true, 0 otherwise.

releasing the supervised training signal in a SSL setting,

we propose an annealed version of the bootstrapped-CE

(ab-CE) in [41]. With an output f(xli) ∈ R

C×H×W in the

form of a probability distribution over the pixels, we only

compute the supervised loss over the pixels with a probabil-

ity less than a threshold η:

Ls =1

|Dl|

∑

xl

i,yi∈Dl

{f(xli) < η}1H(yi, f(x

li)) (4)

To release the supervised training signal, the threshold

parameter η is gradually increased from 1C

to 0.9 during

the beginning of training, with C as the number of output

classes.

3.3. Exploiting weaklabels

In some cases, we might be provided with additional

training data that is less expensive to acquire compared to

pixel-level labels, e.g., image-level labels. Formally, instead

of an unlabeled set Du, we are provided with a weakly

labeled set Dw = {(xw1 , y

w1 ), . . . , (x

wm, ywm)} alongside a

pixel-level labeled set Dl, with ywi is the i-th image-level

label corresponding to the i-th weakly labeled input image

xwi . The objective is to extract additional information from

the weak labeled set Dw to further enhance the representa-

tions of the encoder h. To this end, we add a classification

branch gc consisting of a global average pooling layer fol-

lowed by a classification layer, and pretrain the encoder for

a classification task using binary CE loss.

Following previous works [1, 31, 22], the pretrained en-

coder and the added classification branch can then be ex-

ploited to generate pseudo pixel-level labels yp. We start by

generating the CAMs M as in [61]. Using M ∈ RC×H×W ,

we can then generate pseudo labels yp, with a background

θbg and a foreground θfg thresholds. The pixels with at-

tention scores less than θbg (e.g., 0.05) are considered as

background. For the pixels with an attention score larger

than θfg (e.g., 0.30), they are assigned the class with the

maximal attention score, and the rest of the pixels are ig-

nored. After generating yp, we conduct a final refinement

step using dense CRF [27].

In addition to considering Dw as an unlabeled set and

imposing a consistency over its examples, the pseudo-labels

are used to train the auxiliary networks gka◦h using a weakly

supervised loss Lw. In this case, the loss in Eq. (3) be-

comes:

L = Ls + ωuLu + ωwLw (5)

With

Lw =1

|Dw|

1

K

∑

xw

i∈Dw

K∑

k=1

H(yp, gka(zi)) (6)

12678

Encoder

Main Decoder

Main Decoder

Aux. Decoders

Aux. Decoders

Figure 4. CCT on multiple domains. On top of a shared encoder,

we add domain specific main decoder and K auxiliary decoders.

During training, we alternate between the two domains, sampling

labeled and unlabeled examples and training the corresponding de-

coders and the shared encoder at each iteration.

3.4. CrossConsistency Training on Multiple Domains

In this section, we extend the propose framework to a

semi-supervised domain adaption setting. We consider the

case of two datasets {D(1),D(2)} with partially or fully

non-overlapping label spaces, each one contains a set of la-

beled and unlabeled examples D(i) = {D(i)l ,D

(i)u }. The

objective is to simultaneously train a segmentation network

to do well on the test data of both datasets, which is drown

from the different distributions.

Our assumption is that enforcing a consistency over both

unlabeled sets D(1)u and D

(2)u might impose an invariance

of the encoder’s representations across the two domains. To

this end, on top of the shared encoder h, we add domain spe-

cific main decoder g(i) and auxiliary decoders gk(i)a . Specif-

ically, as illustrated in Fig. 4, we add two main decoders

and 2K auxiliary decoders on top of the encoder h. During

training, we alternate between the two datasets, at each iter-

ation, sampling an equal number of labeled and unlabeled

examples from each one, computing the loss in Eq. (3) and

training the shared encoder and the corresponding main and

auxiliary decoders.

4. Experiments

To evaluate the proposed method and investigate its ef-

fectiveness in different settings, we carry out detailed ex-

periments. In Section 4.4, we present an extensive abla-

tion study to highlight the contribution of each component

within the proposed framework, and compare it to state-

of-the-art methods in a semi-supervised setting. Addition-

ally, in Section 4.5 we apply the proposed method in a

semi-supervised domain adaptation setting and show per-

formance above baseline methods.

4.1. Network Architecture

Encoder. For the following experiments, the encoder

is based on a ResNet-50 [20] pretrained on ImageNet [10]

provided by [55] and a PSP module [59]. Following previ-

ous works [59, 22, 1], the last two strided convolutions of

ResNet are replaced with dilated convolutions.

Decoders. For the decoders, taking the efficiency and

the number of parameters into consideration, we choose to

only use 1 × 1 convolutions. After an initial 1 × 1 con-

volution to adapt the depth to the number of classes C,

we apply a series of three sub-pixel convolutions [45] with

ReLU non-linearities to upsample the outputs to original in-

put size.

4.2. Datasets and Evaluation Metrics

Datasets. In a semi-supervised setting, we evaluate the

proposed method on PASCAL VOC [12], consisting of 21

classes (with the background included) and three splits,

training, validation and testing, with of 1464, 1449 and

1456 images respectively. Following the common practice

[22, 59], we augment the training set with additional images

from [19]. Note that the pixel-level labels are only extracted

from the original training set.

For semi-supervised domain adaption, for partially over-

lapping label spaces, we train on both Cityscapes [8] and

CamVid [3]. Cityscapes is a finely annotated autonomous

driving dataset with 19 classes. We are provided with three

splits, training, validation and testing with 2975, 500 and

1525 images respectively. CamVid contains 367 training,

101 validation and 233 testing images. Although originally

the dataset is labeled with 38 classes, we use the 11 classes

version [2]. For experiments over non-overlapping labels

spaces, we train on Cityscapes and SUN RGB-D [48]. SUN

RGB-D is an indoor segmentation dataset with 38 classes

containing two splits, training and validation, with 5285 and

5050 images respectively. Similar to [24], we train on the

13 classes version [18].

Evaluation Metrics. We report the results using mIoU

(i.e., mean of class-wise intersection over union) for all the

datasets.

4.3. Implementation Details

Training Settings. The implementation is based on the

PyTorch 1.1 [39] framework. For optimization, we train

for 50 epochs using SGD with a learning rate of 0.01 and

a momentum of 0.9. During training, the learning rate is

annealed following the poly learning rate policy, where at

each iteration, the base learning rate is multiplied by 1 −( itermax iter )

power with power = 0.9.

For PASCAL VOC, we take crops of size 321× 321 and

apply random rescaling in the range of [0.5, 2.0] and random

horizontal flip. For Cityscapes, Cam-Vid and SUN RGB-D,

following [24, 23], we resize the input images to 512×1024,

360 × 480 and 480 × 640 respectively, without any data-

augmentation.

Reproducibility All the experiments are conducted on a

V-100 GPUs. The implementation is available at https:

//github.com/yassouali/CCT

12679

Figure 5. Ablation Studies on CamVid with 20, 50 and 100 labeled images. With different types of perturbations and a variable number

of auxiliary decoders K, we compare the individual and the combined effectiveness of the perturbations to the baseline in which the model

is trained only on the labeled examples. CCT full refers to using all of the 7 perturbations, i.e. the number of auxiliary decoder is K × 7.

Figure 6. Ablation study on PASCAL VOC. Ablation study re-

sults with 1000 labeled examples using different perturbations and

various numbers of auxiliary decoders K.

Inference Settings. For PASCAL VOC, Cityscapes and

SUN RGB-D, we report the results obtained on the valida-

tion set, and on the test set of CamVid dataset.

4.4. SemiSupervised Setting

4.4.1 Ablation Studies

The proposed method consists of several types of pertur-

bations and a variable number of auxiliary decoders. We

thus start by studying the effect of the perturbation functions

with different numbers of auxiliary decoders, in order to

provide additional insight into their individual performance

and their combined effectiveness. Specifically, we measure

the effect of different numbers of auxiliary decoders K (i.e.,

K = 2, 4, 6 and 8) of a given perturbation type. We re-

fer to this setting of our method as “CCT {perturbation

type}”, with seven possible perturbations. We also mea-

sure the combined effect of all perturbations resulting in

K × 7 auxiliary decoders in total, and refer to it as “CCT

full”. Additionally, “CCT full+ab-CE” indicates the usage

of the annealed-bootstrapped CE as a supervised loss func-

tion. We compare them to the baseline, in which the model

is trained only using the labeled examples.

CamVid. We carried out the ablation on CamVid with

20, 50 and 100 labels; the results are shown in Fig. 5. We

find that each perturbation outperforms the baseline, with

Method Pixel-level

Labeled

Examples

Image-level

Labeled

Examples

Val

WSSL [38] 1.5k 9k 64.6

GAIN [32] 1.5k 9k 60.5

MDC [53] 1.5k 9k 65.7

DSRG [22] 1.5k 9k 64.3

Souly et al. [49] 1.5k 9k 65.8

FickleNet [31] 1.5k 9k 65.8

Souly et al. [49] 1.5k - 64.1

Hung et al. [23] 1.5k - 68.4

CCT 1k - 64.0

CCT 1.5k - 69.4

CCT 1.5k 9k 73.2

Table 1. Comparison with the-state-of-the-art. CCT perfor-

mance on PASCAL VOC compared to other semi-supervised ap-

proaches.

the most dramatic differences in the 20-label setting with

up to 21 points. We also surprisingly observe an insignif-

icant overall performance gap among different perturba-

tions, confirming the effectiveness of enforcing the consis-

tency over the hidden representations for semantic segmen-

tation, and highlighting the versatility of CCT and its suc-

cess with numerous perturbations. Increasing K results in

a modest improvement overall, with the smallest change for

Con-Msk and Obj-Msk due to their lack of stochasticity.

Interestingly, we also observe a slight improvement when

combining all of the perturbations, indicating that the en-

coder is able to generate representations that are consistent

over many perturbations, and subsequently, improving the

overall performance. Additionally, gradually releasing the

training signal using ab-CE helps increase the performance

with up to 8%, which confirms that overfitting of the labeled

examples can cause a significant drop in performance.

PASCAL VOC. In order to investigate the success of

CCT on larger datasets, we conduct additional ablation ex-

periments on PASCAL VOC using 1000 labeled examples,

The results are summarized in Fig. 6. We see similar re-

12680

sults, where the proposed method makes further improve-

ment compared to the baseline with different perturbations,

from 10 to 15 points. The combined perturbations yield a

small increase in the performance, with the biggest differ-

ence with K = 6. Furthermore, similar to CamVid, when

using the ab-CE loss, we see a significant gain with up to 7points compared to CCT full.

Based on the conducted ablation studies, for the rest

of the experiments, we use the setting of “CCT full” with

K = 2 for Con-Msk and Obj-Msk due to their lack of

stochasticity, K = 2 for I-VAT given its high computa-

tional cost, and K = 6 for the rest of the perturbations, and

refer to it as “CCT”.

4.4.2 Comparison to Previous Work

To further explore the effectiveness of our framework, we

quantitatively compare it with previous semi-supervised se-

mantic segmentation methods on PASCAL VOC. Table 1

compares CCT with other semi-supervised approaches. Our

approach outperforms previous works relying on the same

level of supervision and even methods which exploit image-

level labels. We also observe an increase of 3.8 points when

using additional image-level labels, affirming the flexibility

of CCT, and the possibility of using it with different types

of labels without any learning conflicts.

4.5. SemiSupervised Domain Adaptation Setting

In real world applications, we are often provided with

pixel-level labels collected from various sources, thus dis-

tinct data distributions. To examine the effectiveness of

CCT when applied to multiple domains with a variable de-

gree of labels overlap, we train our model simultaneously

on two datasets, Cityscapes (CS) + CamVid (CVD) for par-

tially overlapping labels, and Cityscapes + SUN RGB-D

(SUN) for the disjoint case.

Methodn=50 n=100

CS CVD Avg. CS CVD Avg.

Kalluri, et al. [24] 34.0 53.2 43.6 41.0 54.6 47.8

Baseline 31.2 40.0 35.6 37.3 34.4 35.9

CCT 35.0 53.7 44.4 40.1 55.7 47.9

Table 2. CCT applied to CS+CVD. CCT performance when

simultaneously trained on two datasets with overlapping label

spaces, which are Cityscapes (CS) and CamVid (CVD).

Cityscapes + CamVid. The results for CCT on

Cityscapes and CamVid datasets with 50 and 100 labeled

examples are given in Table 2. Similar to the SSL set-

ting, CCT outperforms the baseline significantly, where the

model is iteratively trained using only on the labeled exam-

ples, with up to 12 points for n = 100, we even see a modest

increase compared to previous work. This confirms our hy-

pothesis that enforcing a consistency over different datasets

does indeed push the encoder to produce invariant represen-

tation across different domains, and consequently, increases

the performance over the baseline while delivering similar

results on each domain individually.

Method Labeled

Examples

CS SUN Avg.

SceneNet [35] Full (5.3k) - 49.8 -

Kalluri, et al. [24] 1.5k 58.0 31.5 44.8

Baseline 1.5k 54.3 38.1 46.2

CCT 1.5k 58.8 45.5 52.1

Table 3. CCT applied to CS+CVD. CCT performance when

trained on both datasets Cityscapes (CS) and SUN RGB-D (SUN)

datasets, for the case of non-overlapping label spaces.

Cityscapes + SUN RGB-D. For cross domain experi-

ments, where the two domains have distinct labels spaces,

we train on both Cityscapes and SUN RGB-D to demon-

strate the capability of CCT to extract useful visual rela-

tionships and perform knowledge transfer between dissim-

ilar domains, even in completely different settings. The re-

sults are shown in Table 2. Interestingly, despite the distri-

bution mismatch between the datasets, and the high num-

ber of labeled examples (n = 1500), CCT still provides a

meaningful boost over the baseline with 5.9 points differ-

ence and 7.3 points compared to previous work. Showing

that, by enforcing a consistency of predictions on the unla-

beled sets of the two datasets over different perturbations,

we can extract additional training signal and enhance the

representation learning of the encoder, even in the extreme

case with non-overlapping label spaces, without any perfor-

mance drop when an invariance of representations across

both datasets is enforced at the level of encoder’s outputs.

5. Conclusion

In this work, we present cross-consistency training

(CCT), a simple, efficient and flexible method for a consis-

tency based semi-supervised semantic segmentation, yield-

ing state-of-the-art results. For future works, a possible di-

rection is exploring the usage of other perturbations to be

applied at different levels within the segmentation network.

It would also be interesting to adapt and examine the effec-

tiveness of CCT in other visual tasks and learning settings,

such as unsupervised domain adaptation.

Acknowledgements. This work was supported by Ranstad corporate

research chair. We would also like to thank Saclay-IA plateform of Uni-

versite Paris-Saclay and Mesocentre computing center of CentraleSupelec

and Ecole Normale Superieure Paris-Saclay for providing the computa-

tional resources.

12681

References

[1] Jiwoon Ahn and Suha Kwak. Learning pixel-level semantic

affinity with image-level supervision for weakly supervised

semantic segmentation. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition, pages

4981–4990, 2018.

[2] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla.

Segnet: A deep convolutional encoder-decoder architecture

for image segmentation. IEEE transactions on pattern anal-

ysis and machine intelligence, 39(12):2481–2495, 2017.

[3] Gabriel J Brostow, Julien Fauqueur, and Roberto Cipolla.

Semantic object classes in video: A high-definition ground

truth database. Pattern Recognition Letters, 30(2):88–97,

2009.

[4] Zhangjie Cao, Lijia Ma, Mingsheng Long, and Jianmin

Wang. Partial adversarial domain adaptation. In Proceedings

of the European Conference on Computer Vision (ECCV),

pages 135–150, 2018.

[5] Olivier Chapelle, Bernhard Scholkopf, and Alexander

Zien. Semi-supervised learning (chapelle, o. et al., eds.;

2006)[book reviews]. IEEE Transactions on Neural Net-

works, 20(3):542–542, 2009.

[6] Liang-Chieh Chen, George Papandreou, Florian Schroff, and

Hartwig Adam. Rethinking atrous convolution for seman-

tic image segmentation. arXiv preprint arXiv:1706.05587,

2017.

[7] Kevin Clark, Minh-Thang Luong, Christopher D Manning,

and Quoc V Le. Semi-supervised sequence modeling

with cross-view training. arXiv preprint arXiv:1809.08370,

2018.

[8] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo

Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe

Franke, Stefan Roth, and Bernt Schiele. The cityscapes

dataset for semantic urban scene understanding. In Proceed-

ings of the IEEE conference on computer vision and pattern

recognition, pages 3213–3223, 2016.

[9] Jifeng Dai, Kaiming He, and Jian Sun. Boxsup: Exploiting

bounding boxes to supervise convolutional networks for se-

mantic segmentation. 2015 IEEE International Conference

on Computer Vision (ICCV), Dec 2015.

[10] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,

and Li Fei-Fei. Imagenet: A large-scale hierarchical image

database. In 2009 IEEE conference on computer vision and

pattern recognition, pages 248–255. Ieee, 2009.

[11] Terrance DeVries and Graham W Taylor. Improved regular-

ization of convolutional neural networks with cutout. arXiv

preprint arXiv:1708.04552, 2017.

[12] Mark Everingham, Luc Van Gool, Christopher KI Williams,

John Winn, and Andrew Zisserman. The pascal visual object

classes (voc) challenge. International journal of computer

vision, 88(2):303–338, 2010.

[13] Geoff French, Timo Aila, Samuli Laine, Michal Mackiewicz,

and Graham Finlayson. Consistency regularization and

cutmix for semi-supervised semantic segmentation. arXiv


[14] Yaroslav Ganin and Victor Lempitsky. Unsupervised

domain adaptation by backpropagation. arXiv preprint

arXiv:1409.7495, 2014.

[15] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pas-

cal Germain, Hugo Larochelle, Francois Laviolette, Mario

Marchand, and Victor Lempitsky. Domain-adversarial train-

ing of neural networks. The Journal of Machine Learning

Research, 17(1):2096–2030, 2016.

[16] Bo Geng, Dacheng Tao, and Chao Xu. Daml: Domain adap-

tation metric learning. IEEE Transactions on Image Process-

ing, 20(10):2980–2989, 2011.

[17] Yves Grandvalet and Yoshua Bengio. Semi-supervised

learning by entropy minimization. In Advances in neural

information processing systems, pages 529–536, 2005.

[18] Ankur Handa, Viorica Patraucean, Vijay Badrinarayanan, Si-

mon Stent, and Roberto Cipolla. Understanding real world

indoor scenes with synthetic data. In Proceedings of the

IEEE Conference on Computer Vision and Pattern Recog-

nition, pages 4077–4085, 2016.

[19] Bharath Hariharan, Pablo Arbelaez, Lubomir Bourdev,

Subhransu Maji, and Jitendra Malik. Semantic contours from

inverse detectors. In 2011 International Conference on Com-

puter Vision, pages 991–998. IEEE, 2011.

[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.

Deep residual learning for image recognition. In Proceed-

ings of the IEEE conference on computer vision and pattern

recognition, pages 770–778, 2016.

[21] Judy Hoffman, Dequan Wang, Fisher Yu, and Trevor Darrell.

Fcns in the wild: Pixel-level adversarial and constraint-based

adaptation. arXiv preprint arXiv:1612.02649, 2016.

[22] Zilong Huang, Xinggang Wang, Jiasi Wang, Wenyu Liu, and

Jingdong Wang. Weakly-supervised semantic segmentation

network with deep seeded region growing. In Proceedings

of the IEEE Conference on Computer Vision and Pattern

Recognition, pages 7014–7023, 2018.

[23] Wei-Chih Hung, Yi-Hsuan Tsai, Yan-Ting Liou, Yen-Yu

Lin, and Ming-Hsuan Yang. Adversarial learning for

semi-supervised semantic segmentation. arXiv preprint

arXiv:1802.07934, 2018.

[24] Tarun Kalluri, Girish Varma, Manmohan Chandraker, and

CV Jawahar. Universal semi-supervised semantic segmenta-

tion. arXiv preprint arXiv:1811.10323, 2018.

[25] Dahun Kim, Donghyeon Cho, Donggeun Yoo, and In

So Kweon. Two-phase learning for weakly supervised ob-

ject localization. In Proceedings of the IEEE International

Conference on Computer Vision, pages 3534–3543, 2017.

[26] Thomas N Kipf and Max Welling. Semi-supervised classi-

fication with graph convolutional networks. arXiv preprint

arXiv:1609.02907, 2016.

[27] Philipp Krahenbuhl and Vladlen Koltun. Efficient inference

in fully connected crfs with gaussian edge potentials. In Ad-

vances in neural information processing systems, pages 109–

117, 2011.

[28] Abhishek Kumar, Prasanna Sattigeri, and Tom Fletcher.

Semi-supervised learning with gans: Manifold invariance

with improved inference. In Advances in Neural Informa-

tion Processing Systems, pages 5534–5544, 2017.

12682

[29] Samuli Laine and Timo Aila. Temporal ensembling for semi-

supervised learning, 2016.

[30] Dong-Hyun Lee. Pseudo-label: The simple and efficient

semi-supervised learning method for deep neural networks.

In Workshop on Challenges in Representation Learning,

ICML, volume 3, page 2, 2013.

[31] Jungbeom Lee, Eunji Kim, Sungmin Lee, Jangho Lee, and

Sungroh Yoon. Ficklenet: Weakly and semi-supervised se-

mantic image segmentation using stochastic inference. In

Proceedings of the IEEE Conference on Computer Vision

and Pattern Recognition, pages 5267–5276, 2019.

[32] Kunpeng Li, Ziyan Wu, Kuan-Chuan Peng, Jan Ernst, and

Yun Fu. Tell me where to look: Guided attention inference

network. 2018 IEEE/CVF Conference on Computer Vision

and Pattern Recognition, Jun 2018.

[33] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,

Pietro Perona, Deva Ramanan, Piotr Dollar, and C. Lawrence

Zitnick. Microsoft coco: Common objects in context. Lec-

ture Notes in Computer Science, page 740–755, 2014.

[34] Bin Liu, Zhirong Wu, Han Hu, and Stephen Lin. Deep metric

transfer for label propagation with limited annotated data.

arXiv preprint arXiv:1812.08781, 2018.

[35] John McCormac, Ankur Handa, Stefan Leutenegger, and

Andrew J Davison. Scenenet rgb-d: Can 5m synthetic im-

ages beat generic imagenet pre-training on indoor segmenta-

tion? In Proceedings of the IEEE International Conference

on Computer Vision, pages 2678–2687, 2017.

[36] Takeru Miyato, Shin-ichi Maeda, Shin Ishii, and Masanori

Koyama. Virtual adversarial training: a regularization

method for supervised and semi-supervised learning. IEEE

transactions on pattern analysis and machine intelligence,

2018.

[37] Aude Oliva and Antonio Torralba. The role of context in

object recognition. Trends in cognitive sciences, 11(12):520–

527, 2007.

[38] George Papandreou, Liang-Chieh Chen, Kevin P Murphy,

and Alan L Yuille. Weakly-and semi-supervised learning of

a deep convolutional network for semantic image segmenta-

tion. In Proceedings of the IEEE international conference on

computer vision, pages 1742–1750, 2015.

[39] Adam Paszke, Sam Gross, Soumith Chintala, Gregory

Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al-

ban Desmaison, Luca Antiga, and Adam Lerer. Automatic

differentiation in pytorch. 2017.

[40] Vishal M Patel, Raghuraman Gopalan, Ruonan Li, and Rama

Chellappa. Visual domain adaptation: A survey of recent

advances. IEEE signal processing magazine, 32(3):53–69,

2015.

[41] Tobias Pohlen, Alexander Hermans, Markus Mathias, and

Bastian Leibe. Full-resolution residual networks for seman-

tic segmentation in street scenes. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition,

pages 4151–4160, 2017.

[42] Siyuan Qiao, Wei Shen, Zhishuai Zhang, Bo Wang, and Alan

Yuille. Deep co-training for semi-supervised image recogni-

tion. In Proceedings of the European Conference on Com-

puter Vision (ECCV), pages 135–152, 2018.

[43] Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri

Valpola, and Tapani Raiko. Semi-supervised learning with

ladder networks. In Advances in neural information process-

ing systems, pages 3546–3554, 2015.

[44] Fatemeh Sadat Saleh, Mohammad Sadegh Aliakbarian,

Mathieu Salzmann, Lars Petersson, and Jose M Alvarez. Ef-

fective use of synthetic data for urban scene semantic seg-

mentation. In European Conference on Computer Vision,

pages 86–103. Springer, 2018.

[45] Wenzhe Shi, Jose Caballero, Ferenc Huszar, Johannes Totz,

Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan

Wang. Real-time single image and video super-resolution

using an efficient sub-pixel convolutional neural network. In

Proceedings of the IEEE conference on computer vision and

pattern recognition, pages 1874–1883, 2016.

[46] Krishna Kumar Singh and Yong Jae Lee. Hide-and-seek:

Forcing a network to be meticulous for weakly-supervised

object and action localization. In 2017 IEEE International

Conference on Computer Vision (ICCV), pages 3544–3553.

IEEE, 2017.

[47] Chunfeng Song, Yan Huang, Wanli Ouyang, and Liang

Wang. Box-driven class-wise region masking and filling rate

guided loss for weakly supervised semantic segmentation,

2019.

[48] Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao.

Sun rgb-d: A rgb-d scene understanding benchmark suite. In



[49] Nasim Souly, Concetto Spampinato, and Mubarak Shah.

Semi supervised semantic segmentation using generative ad-

versarial network. In Proceedings of the IEEE International

Conference on Computer Vision, pages 5688–5696, 2017.

[50] Antti Tarvainen and Harri Valpola. Mean teachers are better

role models: Weight-averaged consistency targets improve

semi-supervised deep learning results, 2017.

[51] Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun,

and Christoph Bregler. Efficient object localization using

convolutional networks. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition, pages

648–656, 2015.

[52] Yunchao Wei, Jiashi Feng, Xiaodan Liang, Ming-Ming

Cheng, Yao Zhao, and Shuicheng Yan. Object region mining

with adversarial erasing: A simple classification to semantic

segmentation approach. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition, pages

1568–1576, 2017.

[53] Yunchao Wei, Huaxin Xiao, Honghui Shi, Zequn Jie, Jiashi

Feng, and Thomas S. Huang. Revisiting dilated convolution:

A simple approach for weakly- and semi-supervised seman-

tic segmentation. 2018 IEEE/CVF Conference on Computer

Vision and Pattern Recognition, Jun 2018.

[54] Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong,

and Quoc V Le. Unsupervised data augmentation. arXiv


[55] Ansheng You, Xiangtai Li, Zhen Zhu, and Yunhai Tong.

Torchcv: A pytorch-based framework for deep learning in

computer vision. https://github.com/donnyyou/

torchcv, 2019.

12683

[56] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and

David Lopez-Paz. mixup: Beyond empirical risk minimiza-

tion. arXiv preprint arXiv:1710.09412, 2017.

[57] Xiaolin Zhang, Yunchao Wei, Jiashi Feng, Yi Yang, and

Thomas S Huang. Adversarial complementary learning for

weakly supervised object localization. In Proceedings of the

IEEE Conference on Computer Vision and Pattern Recogni-

tion, pages 1325–1334, 2018.

[58] Yang Zhang, Philip David, and Boqing Gong. Curricu-

lum domain adaptation for semantic segmentation of urban

scenes. In Proceedings of the IEEE International Conference

on Computer Vision, pages 2020–2030, 2017.

[59] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang

Wang, and Jiaya Jia. Pyramid scene parsing network. In



[60] Jing Zhao, Xijiong Xie, Xin Xu, and Shiliang Sun. Multi-

view learning overview: Recent progress and new chal-

lenges. Information Fusion, 38:43–54, 2017.

[61] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva,

and Antonio Torralba. Learning deep features for discrim-

inative localization. 2016 IEEE Conference on Computer

Vision and Pattern Recognition (CVPR), Jun 2016.

12684

semi-supervised semantic segmentation with …...however, training deep learning models requires a...

Documents