[email protected] [email protected] [email protected]

Shuffle Augmentation of Features from Unlabeled Data forUnsupervised Domain Adaptation

Changwei Xu∗Xpeng motors

[email protected]

Jianfei Yang∗

Nanyang Technological [email protected]

Haoran TangUniversity of Pennsylvania

[email protected]

Han ZouMicrosoft

[email protected]

Cheng Lu, Tianshuo Zhang†

Xpeng [email protected], [email protected]

Abstract

Unsupervised Domain Adaptation (UDA), a branchof transfer learning where labels for target samples areunavailable, has been widely researched and developedin recent years with the help of adversarially trainedmodels. Although existing UDA algorithms are able toguide neural networks to extract transferable and dis-criminative features, classifiers are merely trained un-der the supervision of labeled source data. Given theinevitable discrepancy between source and target do-mains, the classifiers can hardly be aware of the targetclassification boundaries. In this paper, Shuffle Aug-mentation of Features (SAF), a novel UDA framework,is proposed to address the problem by providing the clas-sifier with supervisory signals from target feature repre-sentations. SAF learns from the target samples, adap-tively distills class-aware target features, and implicitlyguides the classifier to find comprehensive class bor-ders. Demonstrated by extensive experiments, the SAFmodule can be integrated into any existing adversarialUDA models to achieve performance improvements.

1. IntroductionTransfer learning algorithms usually learn class-

aware information from sufficiently labeled source dataand finetune on the target domain to transfer seman-tic representations. The crux of transfer learning is theunavoidable discrepancy between the source and targetdomains, leading to the essential objective of eliminat-ing such inconsistency. Unsupervised Domain Adap-tation (UDA) is a subtopic in transfer learning where

∗These authors contributed equally.†Work done while at Xpeng motors.

(a) source only (b) SAF assisted

Figure 1. Illustration of the Shuffle Augmentation of Fea-tures (SAF) algorithm. Best viewed in color. Squares andcircles indicate different classes. Blue shapes denote sourcesamples, gray shapes denote unrecognized target samples,and green shapes are target samples selected by SAF. (a)Result of a classifier merely trained with labeled sourcedata. Although the classifier learns the optimal boundaryfor the source distribution, many target samples are stillmisclassified due to the domain discrepancy. (b) Result ofan SAF-improved classifier trained with both source andcomposite target features selected by SAF. Supervised bysignals from both source and target domains, the classifieris able to learn a more comprehensive class boundary.

target labels are unavailable, meaning that learning do-main discrepancy between datasets is even more chal-lenging compared with its supervised counterpart.

The Domain Adaptation (DA) problem is formallyintroduced by [2], where the fundamental concepts ofthe discrete H∆H-divergence is established. Variousmethods, based on deep CNN backbones [17, 18, 23],are proposed to tackle UDA tasks by reducing domaindiscrepancy with explicit regularizations [30, 48, 51].Inspired by Generative Adversarial Network (GAN)[13], Domain Adversarial Neural Network (DANN) [10]extracts domain-invariant features with adversarial ap-proach and enlightens other adversarial UDA methods

arX

iv:2

201.

1196

3v1

[cs

.CV

] 2

8 Ja

n 20

22

[33, 43, 50]. Margin Disparity Discrepancy (MDD)[61] further generalizes H∆H-divergence by providinga continuous discrepancy metric.

Although UDA training strategies have been provento be capable of guiding CNNs to extract transferableas well as categorical features, only labeled source fea-tures are utilized for supervised training. In the situa-tion shown in Figure 1a, features of different classes areclearly separated, but the optimal classification bound-ary guided by source-only supervision misclassifies nu-merous target samples. Hence, expanding the super-vised training datasets with target features is necessaryfor further improvements.

Noisy Label (NL) learning aims to train models un-der the presence of incorrect labels, which is similar tothe scenario of target samples with noisy pseudo-labelsin UDA. Despite major differences between the two set-tings, employing NL schemes to assist UDA training inselecting creditable target features is intuitively viable.

In this paper, we propose Shuffle Augmentation ofFeatures (SAF), a novel UDA framework. Inspiredby NL algorithms, SAF learns the distribution of tar-get features, assigns credibility weight to each entry,and guides the classifier to learn comprehensive classboundaries, as illustrated in Figure 1b. Moreover, SAFcan be integrated into any existing adversarial UDAmodels, enhancing their performances. To summarize,our contributions are:

• We analyze the theoretic backgrounds of the UDAproblem, reveal the shortcomings of UDA methodswith source-only supervisions, and discover a novelsolution on the basis of cutting-edge DA theories.Specifically, we tackle the issues by adding targetsamples to supervisory signals. To the best of ourknowledge, we are the first to distill class-variantinformation from the entire target distributions.

• We assimilate ideas from NL methods, adapt pro-cedural details to compensate major differencesbetween NL and UDA settings, and integrate theiressence into the design of UDA algorithms.

• We propose the Shuffle Augmentation of Features(SAF), which can be built upon any adversar-ial UDA methods. SAF learns from target do-main distributions, assigns reliablility weights topseudo-labeled samples, and augments the super-visory training signal.

• We demonstrate the superiority of SAF algo-rithm with extensive experiments on benchmarkdatasets, where the proposed method outperformsexisting UDA models and presents the state-of-the-art performance.

2. Related Work2.1. Unsupervised Domain Adaptation

The fundamental theory for the Domain Adapta-tion (DA) is formally introduced by [2, 3], where theH∆H-divergence along with the estimation formula forerror boundary on the target distribution are formal-ized. Based on above theories, researchers combinedeep CNN [24, 40] backbones [17, 18, 23] with sta-tistical measurements [29, 30, 32, 45, 48, 51, 58] toshrink the gap between source and target features. In-spired by Generative Adversarial Network (GAN) [13],[10, 11] proposes the Domain Adversarial Neural Net-work (DANN), which is the first method employing anadversary player to implicitly align domain features.The adversarial design of DANN further inspires nu-merous algorithms [26, 31, 33, 35, 42, 43, 50].

Nevertheless, [5] reveals that the discriminability ofthe features extracted by DANN actually decreases atthe same time. [61] ascribes the phenomenon to thelimitations of the H∆H-divergence and proposes theMargin Disparity Discrepancy (MDD), an alternativeinconsistency metric generalizable to multi-way classi-fiers, to estimate the H∆H-divergence.

Moreover, clustering-based methods [8, 20] assignvirtual labels to target data and enforce the network tofollow the cluster assumption [14]. GAN-based meth-ods [19, 27, 44] augment data by directly employingGANs, while mixup-based methods [34, 56, 64] com-bine existing samples with the mixup [59] technique.

2.2. Noisy Label Learning

The Noisy Label (NL) learning aims to train mod-els from massive but inaccurately labeled data in real-world scenarios, as clean datasets are usually expensiveto obtain. Traditional NL algorithms address the is-sue by using statistical clustering techniques (e.g. bag-ging, boosting, k-nearest neighbor, etc.) to determinethe reliablility of each label [15, 47]. With the helpof deep CNNs, the following approaches are proposedto tackle the noisy data: assigning weights to samplesw.r.t. their reliablility [38, 54, 60]; selection or creationof clean samples [25, 46]; or dynamically renewing la-bels for mislabeled data [16, 63].

3. PreliminariesGiven an input space X and a label space Y, a do-

main [2] is defined as a pair 〈D, f〉, where D is a dis-tribution on X , and f : X → Y is a labeling function.In Unsupervised Domain Adaptation (UDA), there aretwo datasets: the labeled source dataset (S, YS), andthe unlabeled target dataset (T ,∅), drawn from dif-ferent domains 〈DS , fS〉, 〈DT , fT 〉, respectively. Due

to the discrepancy between two domains, techniquescapable of mining underlying statistic patterns fromfeature distributions are required.

Let C be the hypothesis space of classifiers thatmap from the input space X to [0, 1]|Y|. For an ar-bitrary C ∈ C, x ∈ X , and y ∈ Y, denote Cy(x) asthe predicted probability for x belonging to the classy, and denote C(x) as the predicted label for x. Mar-gin Disparity Discrepancy (MDD) [61] is a continuousmetric for divergence between two domains, which canbe applied to design adversarial objectives beyond bi-nary classifiers. Fixing the threshold %, the empiricaldiscrepancy d

(%)C,C for a classifier C on the hypothesis

space C w.r.t. datasets S, T is defined as:

d(%)C,C(S, T ) , 2 max

C′∈C

(∆(%)S (C,C ′)− ∆(%)

T (C,C ′)), (1)

where ∆(%)S , ∆(%)

T are the empirical margin disparity ondatasets S, T , respectively (definitions can be found inthe supplementary material).

With probability at least 1− 3δ (δ > 0), the risk fora classifier C on the target domain (DT , fT ) satisfiesthe following inequality:

εDT(C) ≤ ε(%)S (C) + d

(%)C,C(S, T ) + λ(%,C,DS ,DT )

+ 2

√log 2

δ

2|S| +

√log 2

δ

2|T |

+K(|Y|, 1

%,

1√|S|

,1√|T |

),

(2)

where ε(%)S (C) is the empirical source risk, λ is a con-stant, and K is a term positively related to the numberof classes |Y| and negatively related to the threshold %and the sizes of both source and target datasets.

However, following the definitions above, the con-cept of the hypothesis class C is questionable and notself-contained. For a fixed classifier structure, the hy-pothesis class C is invariable, but shrinking the dis-crepancy d

(%)C,C(S, T ) requires altering C. In fact, an-

other possible interpretation can be made to under-stand the theoretical backgrounds. Guided by vari-ous supervisory signals (e.g. loss criterions, regular-izations), the process of optimizing CNNs can be re-garded as squeezing the hypothesis class. Supervisionsaim to lead CNNs to the optimized neighborhood C∗

where the local discrepancy d(%)C,C∗(S, T ) is the mini-

mum across the universe:

C∗ := arg minC′⊂C

[maxC∈C′

d(%)C,C′(S, T )

]. (3)

In the rest of this paper, C denotes the hypothe-sis neighborhood of the current classifier C guided bytraining strategies.

4. Shuffle Augmentation of FeaturesWe propose the Shuffle Augmentation of Features

(SAF) algorithm for the UDA problem.

4.1. Motivations

UDA is a special branch in the field of transfer learn-ing, as there is no explicit clues about the target dis-tribution and differences between domains DS and DT .Existing state-of-the-art models [11, 61, 64] claim thatthey are capable of extracting features with decenttransferability and discriminability from both sourceand target samples, but few of these algorithms uti-lize the target features for the training of the classifierC, yielding a suboptimal situation illustrated in Figure1a, where the optimal boundary for the source distribu-tion misclassifies numerous target samples. Therefore,valuable target features shall be exploited to refine theclassification boundaries, as shown in Figure 1b.

The presence of labels is required for supervisedtraining, so the estimated labels YT are required.This introduces another intractable issue: since thepseudo-label estimation could not be completely accu-rate [20, 64], the network inevitably faces noisy labels.As the model needs to learn from correct labels anddiminish the negative influences from incorrect labels,noise-robust algorithms are required to guarantee theconvergence. Intuitively, NL learning [15, 47] is the de-sirable field where noise-robust algorithms have beenwidely studied and deeply developed.

4.2. Gaps Between NL and UDA

Although NL algorithms can effectively extract fea-tures with trustworthy labels from a noisy dataset, itis impractical to naively transfer NL methods to UDAdue to three distinctions between the two settings:

1. In NL learning, all samples are from a unified do-main distribution, but source and target distribu-tions are inconsistent in the UDA setting.

2. Popular NL benchmark datasets [4, 28, 57] oftencontain massive amounts of samples, which areindeed favorable to pattern mining and cluster-ing NL algorithms. By contrast, pervasive UDAbenchmark datasets [12, 41, 53] offer only thou-sands or less samples per domain, where CNNsare more likely to overfit.

3. In the NL setting, the noise ratio are usuallywithin the range of 8% to 38.5% [47]; but in UDA,source labels are always clean and there is no guar-anteed maximal noise ratio for virtually labeledtarget samples.

The above differences between the two settings canpotentially jeopardize the fitting of the NL training

Figure 2. The SAF framework, best viewed in color. Built upon an adversarial UDA model consisting of a feature extractorF , a classifier C, and an adversarial module D, SAF further adds an SAF-mixup module and a bottleneck layer B. SAFrandomly draws pairs of target features and adaptively generates mixup weights based on target distributions. The combinedtarget features FT and corresponding pseudo-labels are used for augmentation of supervisory signals, enabling the modelto discover better classification boundaries for target samples.

schemes, if transferred to the circumstance of UDAwithout reasonable adaptations.

4.3. Algorithm and Training Objectives

Tailored to the characteristics of the UDA setting,the SAF algorithm picks and incorporates the ideas ofmultiple NL methods [16, 38, 60]. The SAF frameworkis built upon an adversarial UDA backbone, wherethree main components (the feature generator F , theclassifier C, and the adversarial module D) are pre-sented. In addition, an SAF module and a bottlenecklayer B are added to the SAF algorithm. The SAFframework is shown in Figure 2.Source Label Prediction Objective. Following the in-equation 26 for the error boundary, the minimization ofthe source classification loss is necessary for decreasingthe target risk. Since the bottleneck layer B is addedto regularize feature representations and takes part inprediction tasks, the training objective of source labelprediction becomes:

(F ∗, B∗, C∗) , arg minF,B,C

LC([C ◦B ◦ F ](S),YS

), (4)

where LC is the classification criterion, usually thecross-entropy loss.Adversarial Domain Adaptation Objective. The do-main adversarial loss LD indicates the discrepancy be-tween feature distributions FS and FT . With the helpof Gradient Reversal Layer (GRL) [10], LD guides theadversary module D to regularize the extractor F forfinding features with better transferability.

In the proposed SAF framework, (F,B) and D playmini-max games to attain the optimum. In general,the domain adversarial objectives can be written as:

(F ∗, B∗) , arg minF,B

λDLD(F,B,D∗, C∗,S, T ), (5)

D∗ , arg maxD

LD(F ∗, B∗, D,C∗,S, T ). (6)

where λD is the GRL weight. The formula of LD com-pletely depends on the design of the adversarial moduleD, which is determined by the UDA backbone selected.

For instance, if the MDD framework [61] were to bechosen as the backbone, LD would be:

LD(F,B,D,C,S, T ) :=

Ex∼T

[LNLL

(1− σ

([D ◦B ◦ F ](x)

))]−

e% · Ex∼S[− LNLL

(σ([C ◦B ◦ F ](x)

))], (7)

where LNLL is the negative log likelihood loss, and σ isthe softmax function.SAF-Supervision Objective. The SAF module (de-noted as M) distills valuable target features via SAF-mixup, a novel mixup variant, to train the classifier C.As shown in Figure 2, the SAF module consists of twoparallel bottlenecks S1, S2, a weight estimator Sη, andtwo operation gates: a matrix addition gate

⊕and a

linear combination gate⊗

.The motivation of using two independent SAF bot-

tlenecks is straightforward: two NNs are able to extractdiversified knowledge on the selection of valuable fea-tures, and they can learn from each other by backprop-agation signals, which provides exceptional efficacy in

Algorithm 1: SAF-mixupModule: bottleneck B, classifier C, SAF

bottlenecks S1, S2, SAF weight estimatorSη

Input: the set of target features FTOutput: augmented features (FT , YT )

1 BEGIN:2 FT ← ∅; YT ← ∅3 while FT 6= ∅ do4 φ1, φ2 ← RandomDrawPair(FT )5 η ← Sη

(S1(φ1) + S2(φ2)

)6 φ← ηφ1 + (1− η)φ2;7 y1 ← C ◦B(φ1); y2 ← C ◦B(φ2)8 y ← ηy1 + (1− η)y2

9 FT ← FT ∪ {φ}; YT ← YT ∪ {y}10 end while11 END.

training. In contrast, a single network not only lacksthe diversity of a pair, but also feeds simplified infor-mation to Sη, impairing the effectiveness of the SAFmodule. The comparison is also investigated in abla-tion study (Table 4).

In SAF-mixup, target features extracted by F arerandomly selected as pairs, and parallelly fed into SAFbottlenecks. Then the sum of filtered features are for-warded into Sη, where the mixup coefficient η is deter-mined according to the relative confidence of the pairedfeatures. Afterwards, target features FT and corre-sponding pseudo-labels YT are linearly combined withmixup coefficients to yield the augmentation dataset(FT , YT ). As the augmented target features are for-warded into the bottleneck B and the classifier C, thesupervisory loss signals can be calculated and broad-casted to network structures via backpropagation. TheSAF-supervision loss is calculated using the output log-its and the corresponding composite labels YT :

LM (FT , YT ) := LCED([C ◦B](FT ), YT

), (8)

where LCED is the cross-entropy divergence (defined inthe supplementary material).

Accordingly, the SAF-supervision objective can bedescribed as the following:

(F ∗,M∗, B∗, C∗) , arg minF,M,B,C

λMLM (FT , YT ), (9)

where λM is the SAF weight.In SAF framework, M is an adaptive Multi-Layer

Perceptron (MLP) trained simultaneously with otherprediction modules, as M is adjusted to yield reliablemixup coefficients. Since F,B,C are trained with com-paratively ample data samples, beneficial mutations of

M will be positively amplified, while harmful gradientsof M that negatively affects the fitting can be amendedand rectified.

The SAF module makes up the deficiencies of UDAmodels with source-only supervision. With the help ofSAF, models are able to learn both source and targetdistributions and to perceive more accurate boundariesfor target samples.The Bottleneck Layer. As C is only trained with sourcefeatures, target feature fragments generated by SAFwould be treated as noises and little meaningful in-formation about the target distributions could be ob-tained. To address this issue, we propose the bottle-neck B, which is able to reshape FT to match the sub-sequent classifier C, because B learns from both sourceand target distributions. The necessity of the bottle-neck in the SAF training scheme is also verified quan-titatively in Section 5.3.

4.4. Theoretical Insight

We provide a theoretical justification of how the pro-posed SAF squeezes the error boundary according tothe margin theories [61], and explains why SAF ef-fectively seeks for novel weight-assigning strategy fornoisy label learning. More details have been written inthe Appendix. Let C be the hypothesis we are inter-ested in and denote C as its hypothesis neighborhood.Since C digests feature representations FS and FT , theerror boundary in formula 26 can be rewritten as:

εFT(h) ≤ ε(%)FS

(h) + d(%)C,C(FS ,FT ) + λ(%,C,DS ,DT )

+ 2

√log 2

δ

2|FS |+

√log 2

δ

2|FT |

+K(|Y|, 1

%,

1√|FS |

,1√|FT |

).

(10)

As explained previously, the SAF algorithm gener-ates augmented target features and expands the su-pervisory dataset, which is equivalent to enlarging thesource sample space with FT and increasing the sourcesample size by |FT |. Since the sum of specific terms informula 10 decreases as we increase sample size of thesource features |FS |, the following inequality holds:

2

√log 2

δ

2|FS ∪ FT |+K

(|Y|, 1

%,

1√|FS ∪ FT |

,1√|FT |

)

< 2

√log 2

δ

2|FS |+K

(|Y|, 1

%,

1√|FS |

,1√|FT |

), (11)

proving that SAF is capable of squeezing the errorboundary on the basis of margin theories.

Moreover, Active Learning theories [6] and applica-tions in deep CNNs [1, 9, 21] have proven that learn-ing from data with uncertain labels dramatically im-proves the performance of deep CNNs. Using the con-ditional entropy H

(C(x)

)=∑y∈Y Cy(x) logCy(x) as

the uncertainty criterion, samples with ambiguous pre-dictions shall be reused to reinforce the model fitting.On the one hand, reliable (or low-entropy) target sam-ples, usually close to existing source clusters, intensifythe biases from the source distribution instead of guid-ing the model to transfer semantic expressions. Onthe other hand, pseudo-labels for unreliable samplesare more likely to be incorrect and noisy, which alsodegrade the model performance.

Therefore, the shuffle mixup among all target sam-ples is utilized to address such dilemma. The diver-sity extracted from high-entropy samples is learnt bythe network to fit on target distributions, while reli-able samples offer trusty information to rectify influ-ences of noisy gradients. The SAF module learns fromthe backpropagation signals to seek for the optimalweight-assigning strategy that balances between the bi-ases brought by the low-entropy samples and the noisesfrom high-entropy samples. The visualization in Fig-ure 3 demonstrates the decent ability of SAF to clusternoisy target samples while simultaneously maintainingclear class boundaries.

4.5. Comparisons to Other Mixup Methods

Virtual Mixup Training (VMT) [34] employs mixupon raw inputs to impose Local Lipschitzness (LL) con-straint across the input space, for the enhancement ofclassifier training. Dual Mixup Regularized Learning(DMRL) [56] further improves VMT by applying themixup regularization on the domain classifier. Never-theless, neither VMT nor DMRL distills classification-related information from target samples while pursuingLL, leaving the condition shown in Figure 1a unsolved.E-MixNet [64] mixes each reliable target sample with adistant source sample. However, as E-MixNet utilizesonly target entries close to the existing source clusters,it disobeys the principle of active learning, and is un-able to learn comprehensive class-variant informationfrom the whole target distribution.

By comparison, only SAF can extract class-variantinformation from the entire target space among allmixup-based UDA methods. As the above models com-bine samples with random numbers (VMT, DMRL) orconstants (E-MixNet), SAF adaptively determines theweights using neural networks. Our experimental re-sults (Table 4) show that the performance of the SAFframework will be impaired if the mixup module is re-placed by weight generators of above methods. This

fact demonstrates that SAF is indeed an unique andirreplaceable structure distinguished from any other ex-isting mixup-based UDA models.

5. ExperimentsThe proposed SAF framework is evaluated on three

benchmark datasets against existing state-of-the-artUDA methods. The source code of SAF is availableat the appendix.

5.1. Setup

Datasets. We evaluate our model on three popularDA benchmark datasets: Office-31 [41], Office-Home[53], and VisDA2017 [37]. Office-31 has three domains(Amazon, DSLR, Webcam) of 31 unbalanced classes,containing 4,652 images. Office-Home has four vi-sually distinct domains (Art, Clipart, Product, Realworld) with 65 classes, providing 15,500 images in to-tal. VisDA2017 is an abundant dataset with 12 classesin two completely different branches where the train-ing domain includes over 150k images of synthetic ren-derings of 3D models, and the validation domain hasapproximately 55k real-world images.Implementation. The SAF bottleneck structure is FC-ReLU, and the SAF weight estimator is FC-Sigmoid.Other implementation details can be found in the sup-plementary material. The SGD [39] with Nesterov [49]momentum 0.9 is used as the optimizer. Initially set to0.004, the learning rate of B,C,M is 10 times to thatof pre-trained F .Baselines. We compare SAF algorithm with the state-of-the-art UDA models, e.g. DANN [10], CDAN [31],BSP [5], CCC-GAN [27], etc. Other mixup-basedmethods (VMT [34], DMRL [56], E-MixNet [64]) arealso chosen as baselines, but only the latest E-MixNetis displayed because other early methods neither haveresults on benchmarks we choose nor provide repro-ducible codes. Moreover, as the experimental SAFmodel is implemented on MDD [61] for optimizedperformance, models built upon MDD (ImA [20], E-MixNet [64]) are also involved for comparison. Thecommonly used full-training protocol [10] is employedfor all experiments. ResNet-50 pre-trained on the Im-ageNet is adopted as the feature extractor, and MDD[61] chosen as the adversarial backbone to implementthe SAF algorithm. All experiments are repeated forfive times with different random seeds, reporting theaverage results.

5.2. Results

The results on Office-31 tasks are shown in Table 1.The MDD+SAF framework attains the highest accu-racies of three tasks and the highest average accuracy

Method A → W D → W W → D A → D D → A W → A AvgSource only 68.4±0.2 96.7±0.1 99.3±0.1 68.9±0.2 62.5±0.3 60.7±0.3 76.1DAN [30] 80.5±0.4 97.1±0.2 99.6±0.1 78.6±0.2 63.6±0.3 62.8±0.2 80.4DANN [10] 82.0±0.4 96.9±0.2 99.1±0.1 79.7±0.4 68.2±0.4 67.4±0.5 82.2ADDA [50] 86.2±0.5 96.2±0.3 98.4±0.3 77.8±0.3 69.5±0.4 68.9±0.5 82.9JAN [33] 86.0±0.4 96.7±0.3 99.7±0.1 85.1±0.4 69.2±0.3 70.7±0.5 84.6MADA [35] 90.0±0.1 97.4±0.1 99.6±0.1 87.8±0.2 70.3±0.3 66.4±0.3 85.2GTA [44] 89.5±0.5 97.9±0.3 99.8±0.4 87.7±0.5 72.8±0.3 71.4±0.4 86.5MCD [43] 89.6±0.2 98.5±0.1 100.0±0.0 91.3±0.2 69.6±0.1 70.8±0.3 86.6CDAN [31] 94.1±0.1 98.6±0.1 100.0±0.0 92.9±0.2 71.0±0.3 69.3±0.3 87.7BSP [5] 93.3±0.2 98.2±0.2 100.0±0.0 93.0±0.2 73.6±0.3 72.6±0.3 88.5CAT [8] 94.4±0.1 98.0±0.2 100.0±0.0 90.8±1.8 72.2±0.2 70.2±0.1 87.6SymNets [62] 90.8±0.1 98.8±0.3 100.0±0.0 93.9±0.5 74.6±0.6 72.5±0.5 88.4ImA [20] 90.3±0.2 98.7±0.1 99.8±0.0 92.1±0.5 75.3±0.2 74.9±0.3 88.8CCC-GAN [27] 93.7±0.2 98.5±0.1 99.8±0.2 92.7±0.4 75.3±0.5 77.8±0.1 89.6E-MixNet [64] 93.0±0.3 99.0±0.1 100.0±0.0 95.6±0.2 78.9±0.5 74.7±0.7 90.2MDD [61] 94.5±0.3 98.4±0.1 100.0±0.0 93.5±0.2 74.6±0.3 72.2±0.1 88.9MDD+SAF 96.7±0.3 99.3±0.2 100.0±0.0 94.4±0.2 77.2±0.1 75.7±0.3 90.5

Table 1. Accuracy (%) on Office-31 for UDA (ResNet-50).

Method Ar→Cl Ar→Pr Ar→Rw Cl→Ar Cl→Pr Cl→Rw Pr→Ar Pr→Cl Pr→Rw Rw→Ar Rw→Cl Rw→Pr AvgSource only 34.9 50.0 58.0 37.4 41.9 46.2 38.5 31.2 60.4 53.9 41.2 59.9 46.1DAN [30] 43.6 57.0 67.9 45.8 56.5 60.4 44.0 43.6 67.7 63.1 51.5 74.3 56.3DANN [10] 45.6 59.3 70.1 47.0 58.5 60.9 46.1 43.7 68.5 63.2 51.8 76.8 57.6JAN [33] 45.9 61.2 68.9 50.4 59.7 61.0 45.8 43.4 70.3 63.9 52.4 76.8 58.3CDAN [31] 50.7 70.6 76.0 57.6 70.0 70.0 57.4 50.9 77.3 70.9 56.7 81.6 65.8BSP [5] 52.0 68.6 76.1 58.0 70.3 70.2 58.6 50.2 77.6 72.2 59.3 81.9 66.3ImA [20] 56.2 77.9 79.2 64.4 73.1 74.4 64.2 54.2 79.9 71.2 58.1 83.1 69.5E-MixNet [64] 57.7 76.6 79.8 63.6 74.1 75.0 63.4 56.4 79.7 72.8 62.4 85.5 70.6MDD [61] 54.9 73.7 77.8 60.0 71.4 71.8 61.2 53.6 78.1 72.5 60.2 82.3 68.1MDD+SAF 56.6 79.8 82.7 68.2 76.6 77.5 65.4 55.8 82.5 74.0 62.3 84.8 72.2

Table 2. Accuracy (%) on Office-Home for UDA (ResNet-50).

among all contestants. As for Office-Home results dis-played in Table 2, MDD+SAF outperforms all base-lines on 8 out of 12 tasks and also achieves the high-est average accuracy. Table 3 presents the accuraciesfor MDD+SAF and multiple baseline models on theVisDA2017 benchmark and again, our proposed modelachieves the overall best performance.

Notice that SAF module improves the performanceof MDD on every benchmark tasks, even on tasks withpoor source-only accuracies (e.g. Cl→Ar, Pr→Ar,Ar→Cl tasks in Office-Home), In addition, SAF sur-passes all other MDD-based models (ImA, E-MixNet)on average accuracies.

The t-SNE visualization [52] shown in Figure 3,where feature representations extracted from both do-mains of two Office-Home tasks are clustered via t-SNEalgorithm, strongly demonstrates the classification im-provement brought by the SAF module.

5.3. Ablation Study and Analytics

Impact of the Bottleneck. As mentioned in Section4.3, the bottleneck layer B functions as a regularizerfor the output of SAF to be better recognized by C.The necessity of B is investigated through experimentson representative tasks of Office-Home. To keep themodel capacity unchanged, B needs to be preserved inplace, so the input samples are forwarded in the orderF→B→M→C. As shown in the “no bottleneck layer”entry of Table 4, the removal of the bottleneck weakensthe model performance.Impact of Adaptive SAF-mixup. In Section 4.5 we the-oretically claim that SAF-mixup is a special mixuptechnique different from others. In this experiment,the SAF-mixup module is replaced by (1) a beta ran-dom variable Beta(0.2, 0.2) (used by [56, 59]), and (2)a constant 0.6 (used by [64]). Based on the compari-son results, the modifications severely impair the accu-racies, demonstrating the significance of the adaptiveSAF module.Impact of Two SAF Bottlenecks. To examine the ne-

Method Accuracy (%)

JAN [33] 61.6GTA [44] 69.5MCD [43] 69.8CDAN [31] 70.0ImA [20] 75.8

MDD [61] 74.6MDD+SAF 77.0

Table 3. VisDA2017 Accuracy for UDA (ResNet-50)

Modification Ar→Pr Pr→Ar

MDD [61] 73.7 61.2no bottleneck layer 75.7 61.6beta random variable η 73.7 57.6constant η 75.3 64.0only one SAF bottleneck 76.0 62.3four SAF bottlenecks 77.3 64.0feed source into SAF 76.4 62.6only mix uncertain samples 73.7 60.7only mix certain samples 76.2 65.4MDD+SAF 79.8 65.4

Table 4. The impact of different modifications to the SAF.

Method A→D A→W D→W Avg

DANN [10] 79.7 82.0 96.9 82.2DANN+SAF 86.0 88.8 99.0 84.5

CDAN [31] 92.9 94.1 98.6 87.7CDAN+SAF 94.6 95.1 99.3 89.7

Table 5. Accuracy (%) of SAF-improved adversarial meth-ods (i.e. DANN [10] and CDAN [31]).

cessity of using two separated SAF bottlenecks, a mod-ified SAF model with a single SAF bottleneck, andanother version with four SAF bottlenecks are tested.The study result shows that either increasing or de-creasing the number of bottlenecks will harm the modelperformance, indicating that using two SAF bottle-necks is the optimal choice.Impact of SAF on Source Data. As only target fea-tures are fed into SAF, it is natural to wonder whetherforwarding the source features into SAF benefits themodel. The “feed source into SAF” entry in Table 4shows the result and the answer is negative. Since thestructure of SAF is relatively simple, its low model ca-pacity disallows itself from fitting on both domains.When SAF learns from both source and target distri-butions, it loses its speciality on the target domain tocompensate the fitting on the source.Impact of Sample Selection. According to active learn-

(a) MDD: Cl→Rw (b) MDD+SAF: Cl→Rw

(c) MDD: Ar→Pr (d) MDD+SAF: Ar→Pr

Figure 3. t-SNE visualizations of features extracted fromOffice-Home tasks. Best viewed in color. Red and bluedots represent source and target feature representations,respectively. The high-resolution version can be found inthe supplementary material.

ing [6], all target samples, with certain or uncertainpredictions, are involved in the SAF-mixup to assimi-late diversity of high-entropy samples and stability oflow-entropy samples. The SAF module can adaptivelybalance the biases of source distribution and the noisesbrought by uncertain samples. To verify the impact ofsample selection, we design experiments where mixupsamples are filtered based on their conditional entropywith an empirical threshold. As shown in Table 4, mix-ing only uncertain samples or excluding them can harmthe model performance.

Generalizability of SAF. The effectiveness of SAF onMDD has been verified in standard experiments. Inaddition, since SAF can be plugged into arbitrary ad-versarial UDA models, the general effectiveness of SAFon all eligible backbones becomes a desirable property.Therefore, we further combine SAF with DANN [10]and CDAN [31] to test the generalizability. As shownin Table 5, the SAF plug-in strongly boosts the accura-cies of DANN and CDAN backbone on Office-31 tasks,which confirms its generalizability.

6. Conclusion

In this paper, a novel SAF algorithm is derived fromthe cutting-edge theories to address the unsuperviseddomain adaptation problem. The key idea of SAF isto distill classification-related information from targetfeatures to augment the training signals. Our SAFframework learns from input targets and guides theclassifier to find reasonable class boundaries for the tar-get distribution. Extensive experiments demonstratethe state-of-the-art performance and decent generaliz-ability of SAF.

References[1] W. H. Beluch, T. Genewein, A. Nurnberger, and J. M.

Kohler. The power of ensembles for active learning inimage classification. In 2018 IEEE/CVF Conferenceon Computer Vision and Pattern Recognition, pages9368–9377, June 2018. 6

[2] Shai Ben-David, John Blitzer, Koby Crammer, AlexKulesza, Fernando Pereira, and Jennifer WortmanVaughan. A theory of learning from different domains.Mach. Learn., 79(1–2):151–175, May 2010. 1, 2, 12

[3] Shai Ben-David, John Blitzer, Koby Crammer, andFernando Pereira. Analysis of representations for do-main adaptation. In B. Scholkopf, J. Platt, andT. Hoffman, editors, Advances in Neural InformationProcessing Systems, volume 19. MIT Press, 2007. 2

[4] Lukas Bossard, Matthieu Guillaumin, and LucVan Gool. Food-101 – mining discriminative compo-nents with random forests. In European Conference onComputer Vision, 2014. 3

[5] Xinyang Chen, Sinan Wang, Mingsheng Long, andJianmin Wang. Transferability vs. discriminability:Batch spectral penalization for adversarial domainadaptation. In Kamalika Chaudhuri and RuslanSalakhutdinov, editors, Proceedings of the 36th Inter-national Conference on Machine Learning, volume 97of Proceedings of Machine Learning Research, pages1081–1090. PMLR, 09–15 Jun 2019. 2, 6, 7, 13, 14

[6] David A. Cohn, Zoubin Ghahramani, and Michael I.Jordan. Active learning with statistical models. J.Artif. Int. Res., 4(1):129–145, Mar. 1996. 6, 8

[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L.Fei-Fei. ImageNet: A Large-Scale Hierarchical ImageDatabase. In CVPR09, 2009. 14, 16

[8] Zhijie Deng, Yucen Luo, and Jun Zhu. Cluster align-ment with a teacher for unsupervised domain adap-tation. In Proceedings of the IEEE/CVF Interna-tional Conference on Computer Vision (ICCV), Oc-tober 2019. 2, 7

[9] Yarin Gal, Riashat Islam, and Zoubin Ghahramani.Deep Bayesian active learning with image data. InDoina Precup and Yee Whye Teh, editors, Proceed-ings of the 34th International Conference on MachineLearning, volume 70 of Proceedings of Machine Learn-ing Research, pages 1183–1192, International Con-

vention Centre, Sydney, Australia, 06–11 Aug 2017.PMLR. 6

[10] Yaroslav Ganin and Victor Lempitsky. Unsuperviseddomain adaptation by backpropagation. In FrancisBach and David Blei, editors, Proceedings of the 32ndInternational Conference on Machine Learning, vol-ume 37 of Proceedings of Machine Learning Research,pages 1180–1189, Lille, France, 07–09 Jul 2015. PMLR.1, 2, 4, 6, 7, 8, 12, 13, 16

[11] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan,Pascal Germain, Hugo Larochelle, Francois Laviolette,Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. J. Mach.Learn. Res., 17(1):2096–2030, Jan. 2016. 2, 3, 13

[12] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesicflow kernel for unsupervised domain adaptation. In2012 IEEE Conference on Computer Vision and Pat-tern Recognition, pages 2066–2073, June 2012. 3

[13] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. Generative adversarialnets. In Z. Ghahramani, M. Welling, C. Cortes, N.Lawrence, and K. Q. Weinberger, editors, Advancesin Neural Information Processing Systems, volume 27.Curran Associates, Inc., 2014. 1, 2, 13

[14] Yves Grandvalet and Yoshua Bengio. Semi-supervisedlearning by entropy minimization. In Proceedings ofthe 17th International Conference on Neural Informa-tion Processing Systems, NIPS’04, page 529–536, Cam-bridge, MA, USA, 2004. MIT Press. 2

[15] Bo Han, Quanming Yao, Tongliang Liu, Gang Niu,Ivor W. Tsang, James T. Kwok, and MasashiSugiyama. A survey of label-noise representationlearning: Past, present and future, 2021. 2, 3

[16] Jiangfan Han, Ping Luo, and Xiaogang Wang. Deepself-learning from noisy labels. In Proceedings ofthe IEEE/CVF International Conference on ComputerVision (ICCV), October 2019. 2, 4

[17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residuallearning for image recognition. In 2016 IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR), pages 770–778, June 2016. 1, 2, 12, 16

[18] Andrew G. Howard, Menglong Zhu, Bo Chen, DmitryKalenichenko, Weijun Wang, Tobias Weyand, MarcoAndreetto, and Hartwig Adam. Mobilenets: Efficientconvolutional neural networks for mobile vision appli-cations, 2017. 1, 2

[19] L. Hu, M. Kan, S. Shan, and X. Chen. Duplex gen-erative adversarial network for unsupervised domainadaptation. In 2018 IEEE/CVF Conference on Com-puter Vision and Pattern Recognition, pages 1498–1507, June 2018. 2

[20] Xiang Jiang, Qicheng Lao, Stan Matwin, and Mo-hammad Havaei. Implicit class-conditioned domainalignment for unsupervised domain adaptation. InHal Daume III and Aarti Singh, editors, Proceedings ofthe 37th International Conference on Machine Learn-ing, volume 119 of Proceedings of Machine LearningResearch, pages 4816–4827. PMLR, 13–18 Jul 2020. 2,

3, 6, 7, 8, 12, 15[21] A. J. Joshi, F. Porikli, and N. Papanikolopoulos.

Multi-class active learning for image classification. In2009 IEEE Conference on Computer Vision and Pat-tern Recognition, pages 2372–2379, June 2009. 6

[22] V. Koltchinskii and D. Panchenko. Empirical Mar-gin Distributions and Bounding the Generalization Er-ror of Combined Classifiers. The Annals of Statistics,30(1):1 – 50, 2002. 14

[23] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-ton. Imagenet classification with deep convolutionalneural networks. In F. Pereira, C. J. C. Burges, L. Bot-tou, and K. Q. Weinberger, editors, Advances in Neu-ral Information Processing Systems, volume 25. Cur-ran Associates, Inc., 2012. 1, 2, 12

[24] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.Howard, W. Hubbard, and L. D. Jackel. Backprop-agation applied to handwritten zip code recognition.Neural Computation, 1(4):541–551, Dec 1989. 2

[25] Kuang-Huei Lee, Xiaodong He, Lei Zhang, and LinjunYang. Cleannet: Transfer learning for scalable imageclassifier training with label noise. In Proceedings ofthe IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2018. 2

[26] S. Lee, D. Kim, N. Kim, and S. Jeong. Drop to adapt:Learning discriminative features for unsupervised do-main adaptation. In 2019 IEEE/CVF InternationalConference on Computer Vision (ICCV), pages 91–100, Oct 2019. 2, 13

[27] Rui Li, Qianfen Jiao, Wenming Cao, Hau-San Wong,and Si Wu. Model adaptation: Unsupervised domainadaptation without source data. In Proceedings of theIEEE/CVF Conference on Computer Vision and Pat-tern Recognition (CVPR), June 2020. 2, 6, 7

[28] Wen Li, Limin Wang, Wei Li, Eirikur Agustsson, andLuc Van Gool. Webvision database: Visual learningand understanding from web data, 2017. 3

[29] Jian Liang, Ran He, Zhenan Sun, and Tieniu Tan. Dis-tant supervised centroid shift: A simple and efficientapproach to visual domain adaptation. In Proceedingsof the IEEE/CVF Conference on Computer Vision andPattern Recognition (CVPR), June 2019. 2, 13

[30] Mingsheng Long, Yue Cao, Jianmin Wang, andMichael I. Jordan. Learning transferable features withdeep adaptation networks. In Proceedings of the 32ndInternational Conference on International Conferenceon Machine Learning - Volume 37, ICML’15, page97–105. JMLR.org, 2015. 1, 2, 7, 13

[31] Mingsheng Long, ZHANGJIE CAO, Jianmin Wang,and Michael I Jordan. Conditional adversarial domainadaptation. In S. Bengio, H. Wallach, H. Larochelle, K.Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Processing Systems,volume 31. Curran Associates, Inc., 2018. 2, 6, 7, 8,13, 16

[32] Mingsheng Long, Han Zhu, Jianmin Wang, andMichael I. Jordan. Unsupervised domain adaptationwith residual transfer networks. In Proceedings ofthe 30th International Conference on Neural Informa-

tion Processing Systems, NIPS’16, page 136–144, RedHook, NY, USA, 2016. Curran Associates Inc. 2, 13

[33] Mingsheng Long, Han Zhu, Jianmin Wang, andMichael I. Jordan. Deep transfer learning with jointadaptation networks. In Doina Precup and Yee WhyeTeh, editors, Proceedings of the 34th InternationalConference on Machine Learning, volume 70 of Pro-ceedings of Machine Learning Research, pages 2208–2217, International Convention Centre, Sydney, Aus-tralia, 06–11 Aug 2017. PMLR. 2, 7, 8, 13

[34] Xudong Mao, Yun Ma, Zhenguo Yang, Yangbin Chen,and Qing Li. Virtual mixup training for unsuperviseddomain adaptation. arXiv preprint arXiv:1905.04215,2019. 2, 6

[35] Zhongyi Pei, Zhangjie Cao, Mingsheng Long, and Jian-min Wang. Multi-adversarial domain adaptation. InSheila A. McIlraith and Kilian Q. Weinberger, edi-tors, Proceedings of the Thirty-Second AAAI Confer-ence on Artificial Intelligence, (AAAI-18), the 30th in-novative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on EducationalAdvances in Artificial Intelligence (EAAI-18), NewOrleans, Louisiana, USA, February 2-7, 2018, pages3934–3941. AAAI Press, 2018. 2, 7, 13

[36] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang,Kate Saenko, and Bo Wang. Moment matching formulti-source domain adaptation. In Proceedings of theIEEE International Conference on Computer Vision,pages 1406–1415, 2019. 14

[37] Xingchao Peng, Ben Usman, Neela Kaushik, JudyHoffman, Dequan Wang, and Kate Saenko. Visda: Thevisual domain adaptation challenge, 2017. 6

[38] Xiaojiang Peng, Kai Wang, Zhaoyang Zeng, QingLi, Jianfei Yang, and Yu Qiao. Suppressing misla-beled data via grouping and self-attention. In An-drea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV2020, pages 786–802, Cham, 2020. Springer Interna-tional Publishing. 2, 4

[39] B. T. Polyak and A. B. Juditsky. Acceleration ofstochastic approximation by averaging. SIAM J. Con-trol Optim., 30(4):838–855, July 1992. 6

[40] D. E. Rumelhart and J. L. McClelland. Learning Inter-nal Representations by Error Propagation, pages 318–362. MIT Press, 1987. 2

[41] Kate Saenko, Brian Kulis, Mario Fritz, and TrevorDarrell. Adapting visual category models to new do-mains. In Proceedings of the 11th European Confer-ence on Computer Vision: Part IV, ECCV’10, page213–226, Berlin, Heidelberg, 2010. Springer-Verlag. 3,6

[42] Kuniaki Saito, Yoshitaka Ushiku, and TatsuyaHarada. Asymmetric tri-training for unsupervised do-main adaptation. In Doina Precup and Yee Whye Teh,editors, Proceedings of the 34th International Confer-ence on Machine Learning, volume 70 of Proceedingsof Machine Learning Research, pages 2988–2997, Inter-national Convention Centre, Sydney, Australia, 06–11Aug 2017. PMLR. 2, 13

[43] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku,and Tatsuya Harada. Maximum classifier discrepancyfor unsupervised domain adaptation. In Proceedings ofthe IEEE Conference on Computer Vision and PatternRecognition (CVPR), June 2018. 2, 7, 8, 13, 14

[44] S. Sankaranarayanan, Y. Balaji, C. D. Castillo, and R.Chellappa. Generate to adapt: Aligning domains usinggenerative adversarial networks. In 2018 IEEE/CVFConference on Computer Vision and Pattern Recogni-tion, pages 8503–8512, June 2018. 2, 7, 8

[45] Rui Shu, Hung H. Bui, Hirokazu Narui, and Ste-fano Ermon. A DIRT-T approach to unsuperviseddomain adaptation. In 6th International Conferenceon Learning Representations, ICLR 2018, Vancouver,BC, Canada, April 30 - May 3, 2018, ConferenceTrack Proceedings. OpenReview.net, 2018. 2, 13

[46] Hwanjun Song, Minseok Kim, and Jae-Gil Lee.SELFIE: Refurbishing unclean samples for robustdeep learning. In Kamalika Chaudhuri and RuslanSalakhutdinov, editors, Proceedings of the 36th Inter-national Conference on Machine Learning, volume 97of Proceedings of Machine Learning Research, pages5907–5915. PMLR, 09–15 Jun 2019. 2

[47] Hwanjun Song, Minseok Kim, Dongmin Park, and Jae-Gil Lee. Learning from noisy labels with deep neuralnetworks: A survey, 2020. 2, 3

[48] Saenko K. Sun B. Deep coral: Correlation alignmentfor deep domain adaptation. ECCV 2016. LectureNotes in Computer Science, vol 9915., 9915, 2016. 1,2, 13

[49] Ilya Sutskever, James Martens, George Dahl, and Ge-offrey Hinton. On the importance of initializationand momentum in deep learning. In Sanjoy Dasguptaand David McAllester, editors, Proceedings of the 30thInternational Conference on Machine Learning, vol-ume 28 of Proceedings of Machine Learning Research,pages 1139–1147, Atlanta, Georgia, USA, 17–19 Jun2013. PMLR. 6

[50] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Ad-versarial discriminative domain adaptation. In 2017IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 2962–2971, July 2017. 2,7, 13

[51] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko,and Trevor Darrell. Deep domain confusion: Maximiz-ing for domain invariance, 2014. 1, 2, 13

[52] Laurens van der Maaten and Geoffrey Hinton. Visu-alizing data using t-sne. Journal of Machine LearningResearch, 9(86):2579–2605, 2008. 7, 16

[53] Hemanth Venkateswara, Jose Eusebio, ShayokChakraborty, and Sethuraman Panchanathan. Deephashing network for unsupervised domain adaptation.In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 5018–5027,2017. 3, 6

[54] Zhen Wang, Guosheng Hu, and Qinghua Hu. Trainingnoise-robust deep neural networks via meta-learning.In Proceedings of the IEEE/CVF Conference on Com-puter Vision and Pattern Recognition (CVPR), June

2020. 2[55] Zeya Wang, Baoyu Jing, Yang Ni, Nanqing Dong,

Pengtao Xie, and Eric P. Xing. Adversarial domainadaptation being aware of class relationships, 2020. 14

[56] Yuan Wu, Diana Inkpen, and Ahmed El-Roby. Dualmixup regularized learning for adversarial domainadaptation. In Andrea Vedaldi, Horst Bischof, ThomasBrox, and Jan-Michael Frahm, editors, Computer Vi-sion – ECCV 2020, pages 540–555, Cham, 2020.Springer International Publishing. 2, 6, 7

[57] Tong Xiao, Tian Xia, Yi Yang, Chang Huang, andXiaogang Wang. Learning from massive noisy labeleddata for image classification. In CVPR, 2015. 3

[58] Ruijia Xu, Guanbin Li, Jihan Yang, and Liang Lin.Larger norm more transferable: An adaptive featurenorm approach for unsupervised domain adaptation.In The IEEE International Conference on ComputerVision (ICCV), October 2019. 2, 13

[59] Hongyi Zhang, Moustapha Cisse, N. Yann Dauphin,and David Lopez-Paz. mixup: Beyond empirical riskminimization. international conference on learningrepresentations, 2018. 2, 7

[60] Weihe Zhang, Yali Wang, and Yu Qiao. Meta-cleaner: Learning to hallucinate clean representationsfor noisy-labeled visual recognition. In Proceedings ofthe IEEE/CVF Conference on Computer Vision andPattern Recognition (CVPR), June 2019. 2, 4

[61] Yuchen Zhang, Tianle Liu, Mingsheng Long, andMichael Jordan. Bridging theory and algorithm for do-main adaptation. In Kamalika Chaudhuri and RuslanSalakhutdinov, editors, Proceedings of the 36th Inter-national Conference on Machine Learning, volume 97of Proceedings of Machine Learning Research, pages7404–7413. PMLR, 09–15 Jun 2019. 2, 3, 4, 5, 6, 7, 8,14, 15

[62] Yabin Zhang, Hui Tang, Kui Jia, and Mingkui Tan.Domain-symmetric networks for adversarial domainadaptation. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages5031–5040, 2019. 7

[63] Z. Zhang, H. Zhang, S. O. Arik, H. Lee, and T. Pfis-ter. Distilling effective supervision from severe labelnoise. In 2020 IEEE/CVF Conference on ComputerVision and Pattern Recognition (CVPR), pages 9291–9300, June 2020. 2

[64] Li Zhong, Zhen Fang, Feng Liu, Jie Lu, Bo Yuan, andGuangquan Zhang. How does the combined risk affectthe performance of unsupervised domain adaptationapproaches? CoRR, abs/2101.01104, 2021. 2, 3, 6, 7

Appendix

6.1. Theoretical Insight:Evolution of UDA Algorithms

In this section, we introduce the path of evolutionin domain adaptation theories, along with UDA al-gorithms based on theoretical backgrounds over time.To summarize, we categorize the theoretical progres-sion of domain adaptation to two stages. In eachstage, researchers discover a specific issue in UDA, ini-tially tackle with traditional statistical methods (ex-plicit techniques), and end up with solving the problemwith implicit approaches.

6.1.1 Notations

For a neural network specified for UDA tasks, denoteF , C, D as feature extractor, classifier, and adversar-ial module, respectively. The feature extractor, F , is astructure that distills feature representations from in-put images. Usually, researchers use pretrained back-bones (e.g. AlexNet [23], ResNet [17]) for the extrac-tor. The classifier, C, is a Multi-Layer Perceptron(MLP) that assigns class labels to input features. Theadversarial module, D, is another MLP that is trainedto regularize F and C through mini-max games. Theadversarial module was first proposed by [10] as a do-main classifier. F,C,D are commonly presented in re-cent UDA frameworks, as the adversarial design hasbecome pervasive among UDA algorithms.

We follow the definition of domain proposed by [2]and generalized by [20]. Given an input space X and alabel space Y, a domain D is a pair 〈D, f〉 consistingof a distribution D on X , and a labeling function f :X → Y. Namely, the labeling function assigns theground-truth label y ∈ Y to each sample x ∈ X .

Let DS = 〈DS , fS〉 denote the source domain on theinput space, with domain label ZS = 0. Let (S, YS) bethe set of available labeled source samples along withempirical distribution DS . Let FS be the set of featurerepresentations extracted from the source dataset, withempirical feature distribution ΦS . Moreover, denotePS as the set of predicted possibility vectors from theclassifier based on FS .

The symmetric (reflected) definitions on the targetdomain: DT = 〈DT , fT 〉, ZT = 1, (T ,YT ), DT , FT ,ΦT , PT are described in exactly the same manner astheir source counterparts.

LetH be the hypothesis class of classifiers that mapsfrom X to {ZS ,ZT }.

6.1.2 Settings

In Unsupervised Domain Adaptation (UDA), there aretwo datasets: the labeled source dataset (S,YS) drawnfrom DS , and the unlabeled target dataset (T ,∅)drawn from DT , sharing the input space X and thelabel space Y. The crux of the UDA setting is that thediscrepancy between two domains cannot be explicitlyalleviated, so methods that are capable of mining un-derlying statistical patterns instead of merely learningfrom labels, are required.

6.1.3 Foundations of Domain Adaptation

We begin with the fundamental theories and basic ter-minologies for domain adaptation problems. For thegiven domains DS and DT , we use the concept H-divergence [2] to measure their discrepancy with re-spect to the hypothesis class H:

dH(DS ,DT ) := 2 suph∈H

∣∣PDS[h = 0]−PDT

[h = 0]∣∣, (12)

which can be empirically estimated if H is symmetric:

dH(S, T ) ,

2(

1−minh∈H

[ 1|S|

∑x∈h0

1[x ∈ S] + 1|T |

∑x∈h1

1[x ∈ T ]]),

(13)

where hZ := {x ∈ (S ∪ T ) | h(x) = Z} for Z ∈ {ZS :=0,ZT := 1}, and 1 is the binary indicator function.

Intuitively, H-divergence measures the discrepancybetween two domains based on the worst domain clas-sifier in H. If the worst classifier h− ∈ H can hardlydistinguish samples from two domains, then the dis-crepancy between S and T would be high on H andvice versa.

Based on above notations, the H∆H-divergence [2]is defined as follows:

dH∆H(DS ,DT ) :=2 suph,h′∈H

∣∣PDS[h 6= h′]− PDT

[h 6= h′]∣∣. (14)

Following this definition, a powerful bounding formulafor the classification error on DT can be derived. Forevery h ∈ H on the target domain DT , with probabilityat least 1− δ (δ > 0):

εDT(h) ≤ εDS

(h) + 12dH∆H(DS ,DT ) + λ, (15)

where λ is a constant.Hypothesis Neighborhood. The concept of hypothe-sis class H becomes confusing in above formulas. For

a fixed CNN structure, the hypothesis space is un-changeable, but shrinking the discrepancy term dH∆Hrequires shifting of H. Actually, we can regard the pro-cess of optimizing CNNs, guided by various supervisorysignals (loss criterions, regularizations), as shrinkingthe hypothesis space. Led by supervisions, the CNNs isguided to find the optimized neighborhood, H∗, in theuniversal hypothesis class H, where the local H∗∆H∗-divergence is minimized across the universe:

H∗ := arg minH′⊂H

[dH′∆H′(DS ,DT )

]. (16)

In the rest of this document, the symbol H indicatesthe hypothesis neighborhood of the current classifier h.When designing domain adaptation algorithms, we arenot interested in the universal hypothesis class, but tryto discover the best optimization strategies that drivethe models to reach H∗.

6.1.4 Stage I: Domain Feature Alignment

After the popularization of the CNN, researchers com-bine traditional statistical techniques with deep CNNs[29, 30, 32, 45, 48, 51, 58] to enhance model perfor-mance in domain adaptation tasks. Generally, theseworks employ explicit methods to measure discrepancybetween feature representations FS and FT (the out-put of the feature extractor F ), or output probabilitiesPS and PT (the output of the classifier C). A commonpractice is to design a loss term based on the calculateddiscrepancy and to encourage the network to align allsamples. Following formula 15, researchers believe thatthe discrepancy between ΦS and ΦT (the domain dis-crepancy among feature) greatly affects the classifica-tion accuracies. Hence, methods in this stage strive toclose the gap from two directions: (1) regularize themodel to ignore domain-variant features; (2) push themodel escape from the suboptimal neighborhoods H∗.To sum up, in this stage, researchers try to distill theessence of explicit methods for better feature alignmentacross different domains.DANN [10] is a revolutionary innovation that changesthe situation. In DANN, domain-variant feature rep-resentations are no longer aligned through the assis-tance of explicit calculations, but implicitly achievedby training another MLP, the domain discriminator D.The DANN was inspired by GAN[13], where two inde-pendent deep neural networks, Generator and Discrim-inator, are trained together but with completely oppo-site goals. The Generator aims to create fake samplesmimicking real samples from random gaussian noises tofool the Discriminator, while the Discriminator strug-gles to distinguish between real samples from datasetsand fake samples generated by the Generator. Abide

(a) (b)

Figure 4. Illustration of discriminability. Best viewed incolor. (a) shows feature distribution with decent discrim-inability, as the classification boundary can be closely es-timated by the classifier. (b) shows the contrary, wherethe dashed line is the ground-truth class border, but clas-sifier with regular capacity usually finds the solid curve aspredicted broader.

by the idea of Discriminator, the goal of the domaindiscriminator D is to recognize whether the featuresextracted by F are from the source distribution or thetarget counterpart. Similar to a Generator, the fea-ture map F , while extracting better feature represen-tations to aid C, also needs to extract domain-invariantfeatures shared by ΦS and ΦT , incapatiating D frommaking correct predictions.

The invention of DANN signals a major change inthe development of UDA algorithms: the alignment ofdomain features can finally be done in an implicit NN-styled way. Not to mention that this implicit modifica-tion outperforms existing explicit methods on popularUDA benchmarks [11].

6.1.5 Stage II: Classification Feature Alignment

As the DANN emerges, adversarial approaches becomefavorable among researchers, which inspires numerousadversarial UDA methods [26, 31, 33, 35, 42, 43, 50].However, another cloud still obscure the sky of UDAresearching: even though domain alignment regular-izations are employed, none of the existing models canachieve the same performance on the target domain ason the source domain.Shortcomings of DANN. After a seires of formal anal-ysis and rigorous experiments, researchers introducetwo concepts that are necessary for understanding thedilemma of DANN: transferability and discriminability[5]. Transferability is an attribute indicating whetherthe feature representations extracted by F are com-pletely shared by both domains, i.e. ΦS = ΦT . Dis-criminability is another property referring to the easi-ness for a classifier to find clear class boundaries amonginput feature representations, as shown in Figure 4.

Formally, the DANN structure, while boost the ac-

curacy on UDA tasks by enhancing transferability, ac-tually misleads F to extract feature representationswith decreased discriminability [5]. To address this is-sue, multiple approaches [5, 43, 55] are attempted toexplicitly increase the discriminability of the featuresextracted by F in adversarial structures. In this stage,researchers aim to design objective functions able tolead F to excavate more class-variant, or discrimina-tive features.

We can also understand this phenomenon by inves-tigating the theoretical supports of DANN. Since theH-divergence (Equation 12) and the H∆H-divergence(Equation 14) employ the discrete 0-1 disparity as therisk criterion:

εD(h) := PD[1[h 6= f ]

], (17)

one can hardly generalize a reasonable scoring func-tion for classifiers with more than two classes fromthem. A theoretically possible but technically imprac-tical approach is to naively create binary classifier forevery pair of classes, but this design is computationallyexpensive, especially for datasets with massive classes(e.g. ImageNet [7], DomainNet [36], etc.). Hence, thebinary classifier D can only function as a domain reg-ularizer and cannot boost the classification accuracy.MDD [61]. Some researchers regard such incompatibil-ity as a gap between theories and algorithms in the fieldof UDA, making the optimization process extremelydifficult or even impossible. To field this gap, the con-cept of Margin Disparity Discrepancy (MDD) [61] isformulated, which is another revolutionary theoreticalinnovation in the UDA research.

The MDD framework is able to implicitly align clas-sification features between two domains with adversar-ial approach. MDD not only bridges the gap betweenexisting discrete domain adaptation theories and con-tinuous objective functions required by model training,but also implicitizes the alignment of classification fea-tures. In other words, MDD first employs the implicitapproach in guiding the feature extractors to distillfeatuer representations with better discriminability. Asa result, MDD outperforms existing methods on bench-marks [61] and again demonstrates the superiority ofimplicit approaches over explicit ones.

6.2. Definitions and Notations

We redefine core DA concepts to adapt the MDDsystem. Let C be the hypothesis space of classifiersthat maps from the feature space Φ to [0, 1]|Y|. Forarbitrary C ∈ C, φ ∈ Φ, and y ∈ Y, denote Cy(φ) asthe predicted probability for φ belonging to the classy, and denote C(φ) as the predicted label for φ. In

addition, X and Φ are interchangeable in this context,as we focus on the optimization of classifiers.Margin. Following the margin theory [22], the conceptmargin [61] of a classifier C with respect to anotherclassifier C ′ on a sample x ∈ X is defined as:

ρC(x,C ′) := 12(CC′(x)(x)− max

y 6=C′(x)Cy(x)

). (18)

Namely, the margin regards the label predicted by C ′

as the exemplar label, and measures the distance be-tween the probability of the “ground-truth label” andthe largest probability among the remainings.Margin Loss. Given a task-dependent constant mar-gin threshold (denoted as %), we would like the margincriterion to satisfy the following properties:

• For any input, the margin loss falls in [0, 1].• The loss becomes 1 if ρC(x,C ′) is negative.• The loss becomes 0 if ρC(x,C ′) is larger than %.• The loss decreases as ρC(x,C ′) increases in [0, %].

Therefore, the fundamental margin loss [61] of a classi-fier C ∈ Cw.r.t. another classifier C ′ ∈ C on a samplex ∈ X is defined as:

ρ(%)C (x,C ′) :=

1 if ρC(x,C ′) ∈ (−∞, 0)1− ρC(x,C ′)/% if ρC(x,C ′) ∈ [0, %]0 if ρC(x,C ′) ∈ (%,∞)

. (19)

Margin Prediction Risk. For a fixed threshold %, definethe margin prediction risk for a classifier C on a domainD= 〈D, f〉 as the following:

ε(%)D (C) := Ex∼D

[ρ

(%)C (x, f)

]. (20)

The empirical margin prediction risk on a dataset Uwith ground-truth labeling function fU can be calcu-lated by:

ε(%)U (C) , 1

|U|∑x∈U

ρ(%)C (x, fU ). (21)

Margin Disparity. The margin disparity [61] (denotedas ∆) on a domain 〈D, f〉 w.r.t. two classifiers C,C ′ ∈C is defined as the following:

∆(%)D (C,C ′) := Ex∼D

[ρ

(%)C (x,C ′)

]. (22)

And its empirical form on a dataset U is:

∆(%)U (C,C ′) , 1

|U|∑x∈U

ρ(%)C (x,C ′). (23)

Margin Disparity Discrepancy. Fixing the threshold%, the margin disparity discrepancy [61] (denoted as

d(%)) for a classifier C in the hypothesis space Cw.r.t.domain distributions DS , DT is defined as:

d(%)C,C(DS ,DT ) :=

2 supC′∈C

(∆(%)DS

(C,C ′)−∆(%)DT

(C,C ′)). (24)

The empirical MDD for C w.r.t. source and domaindatasets S, T is estimated as:

d(%)C,C(S, T ) ,

2 maxC′∈C

(∆(%)S (C,C ′)− ∆(%)

T (C,C ′)). (25)

Error Bound w.r.t. MDD. With the equations above,we are able to deliver an error boundary of C on DT ,with probability at least 1− 3δ (δ > 0):

εDT(C) ≤ ε(%)S (C) + d

(%)C,C(S, T ) + λ(%,C,DS ,DT )

+ 2

√log 2

δ

2|S| +

√log 2

δ

2|T |

+K(|Y|, 1

%,

1√|S|

,1√|T |

),

(26)

where λ is a constant, while K is a term positivelyrelated to the number of classes |Y| and negatively re-lated to the margin threshold %, and the sizes of bothsource and target datasets.Hypothesis Neighborhood (MDD). As mentioned inSection 6.1.3, the concept of unchangeable hypothesisclass C is questionable for the optimization process.In the scenario of MDD, the optimal hypothesis neigh-borhood C∗ is a region where the local discrepancyd

(%)C,C∗(DS ,DT ) is the minimum across the universe:

C∗ := arg minC′⊂C

[maxC∈C′

d(%)C,C′(DS ,DT )

]. (27)

Similarly, we use C to denote the hypothesis neighbor-hood of the current classifier C.

6.3. Cross-Entropy Divergence

The cross-entropy loss, LCEL, is a pervasive lossfunction. For a set of logit vectors P with ground-truthlabels Q:

LCEL(P,Q) = E(X,y)∈(P,Q)[− log[σy(X)]

], (28)

where σy is the softmax possibility for label y. How-ever, the cross-entropy loss cannot measure the diver-gence between two discrete distribution vectors.

Moreover, although other statistical metrics (e.g.KL-divergence, JS-divergence, etc.) provide decent dis-crepancy measurements, their quantitative outputs are

Algorithm 2: SAF-mixupModule: bottleneck B, classifier C

SAF bottlenecks S1, S2SAF weight estimator Sη

Input: the set of target features FTOutput: augmented features (FT , YT )

1 BEGIN:# initialize output sets

2 FT ← ∅; YT ← ∅3 while FT 6= ∅ do

# draw from FT without replacement4 φ1, φ2 ← RandomDrawPair(FT )

# feed φ1, φ2 into S1, S2, respectively;# then feed the sum into Sη to get weight

5 η ← Sη(S1(φ1) + S2(φ2)

)# linearly combine φ1, φ2 w.r.t. η

6 φ← ηφ1 + (1− η)φ2;# get pseudo-labels for the pair

7 y1 ← C ◦B(φ1); y2 ← C ◦B(φ2)# similar for y1, y2

8 y ← ηy1 + (1− η)y2

# update FT and YT9 FT ← FT ∪ {φ}; YT ← YT ∪ {y}

10 end while11 END.

usually too insignificant to balance the source super-vision losses calculated via the cross-entropy criterion,which is unfavorable for training CNNs.

To address this disadvantage, we generalize thecross-entropy loss to composite labels by defining thecross-entropy divergence, LCED:

LCED(P,Q) = E(X,Y )∈(P,Q)[− Y T log[σ(X)]

], (29)

where P is the set of predicted logits, and Q is the setof composite labels.

6.4. Algorithm

The commented SAF-mixup algorithm is displayedin Algorithm 2.

The training scheme for the complete SAF frame-work is shown in Algorithm 3.

6.5. Implementation Details

6.5.1 Network Structures

The network structure designed for experiments is builtupon source codes of MDD1 [61] and of ImA2 [20]. The

1https://github.com/thuml/MDD/2https://github.com/xiangdal/implicit_alignment/

https://github.com/thuml/MDD/

https://github.com/xiangdal/implicit_alignment/

source code can be found in our supplementary mate-rials.The feature extractor F is a ResNet50 [17] pre-trainedon the ImageNet [7], following the commonly-usedUDA training protocol [10, 31].The bottleneck B uses the following structure:

• Fully-Connected Layer (2048→1024)• Batch Normalization Layer• ReLU Layer• Dropout Layer (50%)

to filter input features.The classifier C and the adversarial module D sharesthe same architecture in the MDD structure, where Dis able to regularize F for extraction of transferable aswell as discriminative features:

• Fully-Connected Layer (1024→1024)• ReLU Layer• Dropout Layer (50%)• Fully-Connected Layer (1024→|Y|)

where |Y| is the number of classes.SAF bottlenecks S1, S2 use relatively simple structureto process feature representations from F :

• Fully-Connected Layer (2048→384)• ReLU Layer

while the SAF weight estimator consists of:• Fully-Connected Layer (384→1)• Sigmoid Layer

which yields a weight η ∈ (0, 1).

6.5.2 Hyperparameters

The GRL [10] weight λD is initially set to 0 and grad-ually increases to 0.1, with the increasing function:

λD(t) = 0.1 tanh 10tT , (30)

where t is the current iteration number, and T = 105

is the total training iteration.Similarly, the SAF mixup weight λM increases from

0 to 0.1 in a slower pace:

λM (t) = 0.1 tanh 5tT . (31)

6.6. Visualizations

Two sets of t-SNE visualizations [52] are shown inFigure 5. Figure 5a and 5b displays the features ex-tracted from the Office-Home Cl→Rw task. Figure 3cand 5d displays the features extracted from the Office-Home Ar→Pr task.

Algorithm 3: SAF Training AlgorithmModule: feature map F , bottleneck B

classifier C, adversarial module DSAF mixup module M

Parameter: learning rate λInput: source dataset (S,YS)

target dataset (T , ∅)1 BEGIN:

# source label prediction objective2 PS ← [C ◦B ◦ F ](S)3 εC ← LC(PS ,YS)

# adversarial domain adaptation objective4 εD ← LD(F,B,D,C,S, T )

# SAF-supervision objective5 FT ← F (T )6 FT , YT ←M(FT )7 εM ← LM (FT , YT )

# backpropagation (denoted as ←∗)8 F,M,B,C ←∗ −λ(εC + λDεD + λMεM )9 D ←∗ λεD

10 END.

The visualization comparison between MDD andSAF shows that the SAF framework greatly improvesfeature clustering and semantic transferring betweentwo domains.

(a) MDD: Cl→Rw (b) MDD+SAF: Cl→Rw

(c) MDD: Ar→Pr (d) MDD+SAF: Ar→Pr

Figure 5. t-SNE visualizations for features extracted from Office-Home Cl→Rw and Ar→Pr tasks. Best viewed in color.Red and blue dots represent source and target feature representations, respectively.

[email protected] [email protected] [email protected]

Documents