arxiv:2110.02117v1 [cs.cv] 5 oct 2021

17
Self-Supervised Generative Style Transfer for One-Shot Medical Image Segmentation Devavrat Tomar 1 Behzad Bozorgtabar 1,2 Manana Lortkipanidze Guillaume Vray 1 Mohammad Saeed Rad 1 Jean-Philippe Thiran 1,2 1 LTS5, EPFL, Switzerland 2 CIBM, Switzerland {devavrat.tomar, firstname.lastname}@epfl.ch Training Phase Dataset Site A Dataset Site B Learn Flow distributions of site A, B Learn Style distributions of site A, B Generation Phase Sample Site A Style Sample Site A Flow + Single Labeled Atlas from Site A Sample Site B Style Sample Site A Flow + + + Site A alike Image Segmentation pairs Site A alike Image Segmentation pairs with Image Style from Site B Self-Supervised Style Clustering Fully Supervised Ours Result Figure 1: Cross-site adaptation of the proposed one-shot segmentation vs. Fully Supervised segmentation. Our method uses volumetric self-supervised learning for style transfer by leveraging unlabeled data. Zoom in for the best view. Abstract In medical image segmentation, supervised deep net- works’ success comes at the cost of requiring abundant la- beled data. While asking domain experts to annotate only one or a few of the cohort’s images is feasible, annotating all available images is impractical. This issue is further ex- acerbated when pre-trained deep networks are exposed to a new image dataset from an unfamiliar distribution. Us- ing available open-source data for ad-hoc transfer learning or hand-tuned techniques for data augmentation only pro- vides suboptimal solutions. Motivated by atlas-based seg- mentation, we propose a novel volumetric self-supervised learning for data augmentation capable of synthesizing vol- umetric image-segmentation pairs via learning transforma- tions from a single labeled atlas to the unlabeled data. Our work’s central tenet benefits from a combined view of one- shot generative learning and the proposed self-supervised training strategy that cluster unlabeled volumetric images with similar styles together. Unlike previous methods, our method does not require input volumes at inference time to synthesize new images. Instead, it can generate diversi- fied volumetric image-segmentation pairs from a prior dis- tribution given a single or multi-site dataset. Augmented data generated by our method used to train the segmenta- tion network provide significant improvements over state- of-the-art deep one-shot learning methods on the task of brain MRI segmentation. Ablation studies further exempli- fied that the proposed appearance model and joint train- ing are crucial to synthesize realistic examples compared to existing medical registration methods. The code, data, and models are available at https://github.com/ devavratTomar/SST/. 1. Introduction Automated medical image segmentation, for example, to localize anatomical structures, is of great importance for disease diagnosis or treatment planning. Fully supervised deep neural networks (DNNs) [31, 38] achieve state-of-the- art results when trained on large amounts of labeled data. However, acquiring abundant labeled data is often not fea- sible, as manual labeling is tedious and costly. Using avail- able open-source data for domain adaptation [41, 5, 4] or hand-tuned approaches for augmentation only provides sub- optimal solutions. Furthermore, the cross-modality adap- tation methods [41, 42, 6] usually rely on fully labeled 1 arXiv:2110.02117v1 [cs.CV] 5 Oct 2021

Upload: others

Post on 19-Nov-2021

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:2110.02117v1 [cs.CV] 5 Oct 2021

Self-Supervised Generative Style Transfer for One-Shot Medical ImageSegmentation

Devavrat Tomar1 Behzad Bozorgtabar1,2

Manana Lortkipanidze Guillaume Vray1 Mohammad Saeed Rad1 Jean-Philippe Thiran1,2

1LTS5, EPFL, Switzerland 2CIBM, Switzerland{devavrat.tomar, firstname.lastname}@epfl.ch

Training Phase

Dataset Site A

Dataset Site B

Learn Flowdistributions of site A, B

Learn Styledistributions of site A, B

Generation Phase

Sample Site A Style

Sample Site A Flow+

Single Labeled Atlas from Site A

Sample Site B Style

Sample Site A Flow+

+

+

Site A alike Image Segmentation pairs

Site A alike Image Segmentation pairs with Image Style from Site B

Self-Supervised Style Clustering

Fully SupervisedOurs

Result

Figure 1: Cross-site adaptation of the proposed one-shot segmentation vs. Fully Supervised segmentation. Our methoduses volumetric self-supervised learning for style transfer by leveraging unlabeled data. Zoom in for the best view.

Abstract

In medical image segmentation, supervised deep net-works’ success comes at the cost of requiring abundant la-beled data. While asking domain experts to annotate onlyone or a few of the cohort’s images is feasible, annotatingall available images is impractical. This issue is further ex-acerbated when pre-trained deep networks are exposed toa new image dataset from an unfamiliar distribution. Us-ing available open-source data for ad-hoc transfer learningor hand-tuned techniques for data augmentation only pro-vides suboptimal solutions. Motivated by atlas-based seg-mentation, we propose a novel volumetric self-supervisedlearning for data augmentation capable of synthesizing vol-umetric image-segmentation pairs via learning transforma-tions from a single labeled atlas to the unlabeled data. Ourwork’s central tenet benefits from a combined view of one-shot generative learning and the proposed self-supervisedtraining strategy that cluster unlabeled volumetric imageswith similar styles together. Unlike previous methods, ourmethod does not require input volumes at inference timeto synthesize new images. Instead, it can generate diversi-fied volumetric image-segmentation pairs from a prior dis-tribution given a single or multi-site dataset. Augmented

data generated by our method used to train the segmenta-tion network provide significant improvements over state-of-the-art deep one-shot learning methods on the task ofbrain MRI segmentation. Ablation studies further exempli-fied that the proposed appearance model and joint train-ing are crucial to synthesize realistic examples comparedto existing medical registration methods. The code, data,and models are available at https://github.com/devavratTomar/SST/.

1. Introduction

Automated medical image segmentation, for example, tolocalize anatomical structures, is of great importance fordisease diagnosis or treatment planning. Fully superviseddeep neural networks (DNNs) [31, 38] achieve state-of-the-art results when trained on large amounts of labeled data.However, acquiring abundant labeled data is often not fea-sible, as manual labeling is tedious and costly. Using avail-able open-source data for domain adaptation [41, 5, 4] orhand-tuned approaches for augmentation only provides sub-optimal solutions. Furthermore, the cross-modality adap-tation methods [41, 42, 6] usually rely on fully labeled

1

arX

iv:2

110.

0211

7v1

[cs

.CV

] 5

Oct

202

1

Page 2: arXiv:2110.02117v1 [cs.CV] 5 Oct 2021

source datasets. Medical images can vary significantly frominstitution to institution in terms of vendor and acquisi-tion protocols [25]. As a result, the pre-trained deep net-works often fail to generalize to new test data that are notdistributed identically to the training data. Furthermore,although hand-tuned data augmentation (DA) techniques[1, 32] caused improvement in segmentation accuracy [37]partially, manual refinements are not sustainable, and aug-mented images have limited capacities to cover real vari-ations and complex structural differences found in differ-ent images. Therefore, few-shot learning [35, 40] or self-supervised learning [10, 18] based approaches that alleviatethe need for large labeled data would be of crucial impor-tance. However, these approaches have not been exploredmuch for low labeled data regime medical image segmen-tation, e.g., one-shot scenarios. Another classical medicalimaging approach often used to reduce the need for labeleddata for synthesis and segmentation purposes is an atlas-based approach [7, 2, 26]. In this approach, an atlas is usedto register each image and warp one into another (labels un-dergo the same transformation). One atlas (with its label) isenough for the procedure; however, one can utilize more ifavailable to improve accuracy. Nevertheless, heterogeneityof medical images causes inaccurate warping and, conse-quently, erroneous segmentation. This heterogeneity issueis even more pronounced for the multi-site medical dataset.Recently, atlas-based approaches empowered by deep con-volutional neural networks (CNNs) [52, 14, 51] enable thedevelopment of one-shot learning segmentation models.

Among recent one-shot learning methods, two are themost relevant to ours [52, 45]. In the first work, Zhao etal. [52] proposed a learning framework to synthesize la-beled samples using CNN to warp images based on atlas.However, [52] only recreates exact styles/deformations pre-sented in the dataset without inducing diversity. Further-more, training includes two separate steps for spatial andappearance transformations bringing extra computationaloverhead. In the second work [45], a one-shot segmen-tation method has been proposed based on the forward-backward consistency strategy to stabilize the training pro-cess. Nonetheless, since the atlas’ style does not match theunlabeled image style, this results in imprecise registrationand imperfect segmentation. In summary, our main contri-butions are as follows:

• We propose a novel volumetric self-supervised con-trastive learning to learn style representation from un-labeled data facilitating registration and consequentlysegmentation task in the presence of intra-site andinter-site heterogeneity of MR images (see Fig. 1);

• Unlike current state-of-the-art methods, our methoddoes not require input volumes at inference time tosynthesize new images. Instead, it can generate diver-

sified and unlimited image-segmentation pairs by justsampling from a prior distribution;

• Previous methods train the spatial and appearancemodels separately, resulting in sub-optimal comparedto our joint optimization of all modules. Our methodachieved state-of-the-art one-shot segmentation per-formance on two brain T1-weighted MRI datasets andimproved the generalization ability for cross-site adap-tation scenarios.

2. Related WorkData Augmentation. Various data augmentation tech-

niques have been developed to compensate for the extremescarcity of labeled data encountered in medical image seg-mentation [9, 34, 52]. Augmentation approaches rangefrom geometric transformations [31, 33] to data-driven aug-mentation methods [17, 27]. While former methods are of-ten difficult to tune as they have limited capacities to coverreal variations and complex structural differences found inimages, latter approaches often learn generative models tosynthesize images to combine with real images and trainthe segmentation model. However, variations in shape andanatomical structures can negatively impact their perfor-mance, especially when little training data is available. Inthis regard, image registration has been effective in approx-imating deformations between unlabeled images so that theaugmented images with plausible deformations can be ob-tained [52, 50, 8]. The augmented images then allow train-ing deep segmentation models with few labeled examples ina semi-supervised manner. Disappointingly, heterogeneityof medical images often yields inaccurate warping betweenthe moving image and the fixed image and, consequently,inaccurate segmentation. Olut et al. [34] proposed an aug-mentation method to leverage statistical deformation mod-els from unlabeled data via principal component analysis.Shen et al. [39] proposed a geometric based image aug-mentation method that generates realistic images via inter-polation from the geodesic subspace to estimate the spaceof anatomical variability. He et al. [19] proposed a DeepComplementary Joint Model (DeepRS) for medical imageregistration and segmentation.

Image Segmentation in Low Data Regimes. Self-supervised learning and few-shot learning are two facets ofthe same problem: training a deep model in low labeleddata regime. These approaches have been used for sparselyannotated images for segmentation but without much suc-cess. The former often rely on many training classes toavoid overfitting [44, 40], while the latter requires fine-tuning on sufficient labeled data before testing [16, 46].Deep atlas-based models [3, 52, 45, 13, 2, 26] using a singleatlas or multi-atlas tackled weakly-supervised medical im-age segmentation. Balakrishnan et al. [3] developed Vox-

2

Page 3: arXiv:2110.02117v1 [cs.CV] 5 Oct 2021

Style Encoder

Appearance Model

target styled source image

sour

ce im

age

targ

et im

age Predicted

image

Flow

M

odel

Flow

fiel

d

Warp

Flow AAE

Flow Decoder

Flow Encoder

Flow

cod

e

Real / fake Dflow N(0,I)

Flow

fiel

d

Flow

fiel

d

Style Encoder

Q: Style Encoder Style code

Real / fake Dstyle N(0,I)

K: Style Encoder T

Style code

Volumetric contrastive loss

enqu

eue

queue

key:

1

key:

N

Style Encoder

Trained Appearance

Model

target styled base image

base

imag

e ta

rget

imag

e

Trai

ned

Flow

M

odel

Flow

fiel

d

Flow AAE

Figure 2: Schematic description of the training phase. The Appearance Model applies the target image’s style to thesource image using its style code predicted by the Style Encoder, which is trained in parallel using self-supervised volumetriccontrastive loss. Then, the Flow Model non-linearly warps the style translated source image to the target image. This allowsbackpropagation of supervision signals to all three models. Independently, Flow AAE maps the flow fields corresponding tothe base image generated by using the trained Flow model and Appearance model to a normally distributed latent space.Dstyle and Dflow are trained in an adversarial manner to ensure that the style codes and the flow codes are normally distributed.

elMorph, which aims to estimate pairwise 3D image regis-tration through a learned CNN-based spatial transformationfunction. In a one-shot learning context, it learns to reg-ister a labeled atlas to any other unlabeled volume. Thismodel suffers from variance in voxel intensity confusingthe spatial transformer. In this regard, a similar unsuper-vised method [13] has been proposed that combines a con-ventional probabilistic atlas-based segmentation with deeplearning for MRI scan segmentation. More recently, fewdeep models [52, 45] explore the one-shot setting for medi-cal image segmentation. Nevertheless, these methods eitheruse samples from a single site (hospital) [45] or aggregatedata from multiple sites [52] without cross-dataset transferlearning capability.

3. Method

We first recap the concept of our proposed one-shot atlas-based medical image segmentation. We formulate the syn-thesis of novel volumetric images and their correspondingsegmentation labels as a learned random spatial and styledeformation of the given single labeled volumetric atlas im-age (referred to as base image) from learned latent space.We employ a Style Encoder (Sec. 3.1) that learns to clustersimilar styled images together in a self-supervised mannerusing volumetric contrastive loss by adapting MomentumContrast [18] while imposing a normal distribution prior onthe latent style codes using adversarial training [28]. TheAppearance Model (Sec. 3.2) is trained to generate dif-ferent styles of the base image without changing its spatial

structure. For learning the spatial deformation correspon-dences (referred to as flow) between the base image andthe target image, we employ Flow Model (Sec. 3.3) that istrained on the task of registering two different image vol-umes with the same style. This is achieved by first chang-ing the style of moving image (referred to as source image)to match the target image’s style (as obtained by Style En-coder) using the Appearance Model, followed by morphingit into the target image. As discussed later in Sec 4.1, match-ing the source image’s style to the target image improves theregistration accuracy. We employ an additional adversar-ial autoencoder [28] that encodes the Flow model’s outputfor the base image into a Gaussian prior flow latent spaceto learn the distribution of the spatial deformation fieldscorresponding to the base image. At test time, we sampleflow latent code and style latent code from Gaussian priorto generate new deformation fields and style appearancesfor the base image, respectively. Fig. 2 and Fig. 3 show anoverview of the training procedure and the data generationat test time, respectively. To quantify the images’ qualityand their corresponding segmentation labels generated us-ing our approach, we train a separate 3D U-Net [12] onthem. We test our model on the real image/segmentationpairs and compare it with the performance of a fully super-vised model trained using real data. The loss terms usedfor training Style Encoder, Appearance Model, and FlowModel, described in the subsequent subsections.

3

Page 4: arXiv:2110.02117v1 [cs.CV] 5 Oct 2021

3.1. Style Encoder

The Style Encoder aims to incentivize content-invariantimage-level representation that brings together similarstyled images and pushes apart dissimilar styled images.To do so, we propose a new volumetric contrastive learn-ing [10, 18, 11] based strategy for training. In particular,we adapt Momentum Contrast [18] for volumetric medicalimages for clustering task of images’ styles as opposed tothe original classification task [18]. More importantly, weuse the learned spatial transformer1 to generate the positivekeys (preserving the styles) instead of standard augmenta-tion, e.g., random cropping used in the original formula-tion. Without loss of generality, we keep a dictionary ofkeys {k1, k2, ..., kK} that represents different styles. Dur-ing a training step, we sample a volumetric image q (calledquery) from the training set and generate its correspond-ing positive key volumetric image k+ by warping q to arandomly selected volumetric image from the training setusing a pre-trained spatial transformer T . This ensures thatq and k+ have the same style (with different structural ge-ometries) that is different from the style keys in the dictio-nary. The dictionary’s style keys are generated by a separatemodel (key-Style Encoder) whose weights are updated asa moving average of the weights of the Style Encoder withmomentum m = 0.99. The volumetric contrastive loss iscomputed as below:

Lvol cl = − logexp(q.k+/τ)∑Ki=0 exp(q.ki/τ)

(1)

where k+ = T (q) and τ is a temperature hyper-parameter[48]. The sum is over one positive and K negative samples.

3.2. Appearance Model

The Appearance Model is responsible for translating thestyle of the source image (s) to that of the target image (t),given the style latent code as predicted by the Style Encoder.We feed the target image’s style code to the adaptive in-stance normalization (AdaIN) layers [20] of the model toperform style transfer. Thus, the target styled source image(s) is obtained as:

s = A(s, Estyle(t)) (2)

where A denotes the Appearance model, and Estyle repre-sents the Style Encoder. The Appearance Model loss Lapp

consists of two components: Lapp = Lstylecycle + L

styleid , where

Lstylecycle and Lstyle

id denote the style consistency loss and styleidentity loss, respectively that are described below.

Style Consistency Loss. We include a style consistencyloss that guides the Appearance model to generate images

1The spatial transformer is trained as in [3].

with the same spatial structure as the source image but witha different style in a consistent cyclic manner. Given thestyle codes of the source and target images, the followingstyle consistency loss is computed as:

Lstylecycle = LSSIM-L1

(s,A(s, Estyle(s))) (3)

where LSSIM-L1computes multi-scale structural similarity

index [47] and L1 distance between the two images as:

LSSIM-L1(u, v) =∥∥u− v∥∥

1+ (1− SSIM(u, v)) (4)

Style Identity Loss. We also include a regularization lossterm, called style identity, that enforces the AppearanceModel to generate the same image using its own style.

Lstyleid = LSSIM-L1

(s,A(s, Estyle(s))) (5)

3.3. Flow Model

The Flow Model builds upon a spatial transformer net-work that warps a moving image (Im) to the fixed image(If ). The spatial transformer (F) generates a correspon-dence map δp, referred to as flow, which is used to registerIm onto If . This warping operation is defined as:

δp = F(Im, If )y = δp� Im

(6)

where � denotes the warping operator, and y is the pre-dicted image. Once we know the correspondence map δpbetween the base image (b) and the target image (t), we cantransfer the segmentation label of the base image (bseg) ontothe target image using the same warping operation.

tseg = F(b, t)� bseg (7)

To learn the distribution of deformation fields correspond-ing to the base image predicted by the Flow Model, wetrain a separate adversarial autoencoder [28] on its out-put. This allows us to encode the flow δp in the latentspace, which can be used later to generate novel volumetricimages and corresponding segmentation labels (Sec. 3.6).The Flow Model loss Lflow consists of two components:Lflow = Lflow

recon + λregLflowreg , where Lflow

recon and Lflowreg denote

the reconstruction loss and the flow regularization loss thatare described below.

Reconstruction Loss. In contrast to the NormalizedCross-Correlation loss [3] commonly used for the voxel reg-istration, we include pixel similarity based reconstructionloss between the target image (t) and the spatially warpedtarget styled source image (referred to as predicted image)

4

Page 5: arXiv:2110.02117v1 [cs.CV] 5 Oct 2021

obtained by Flow Model. Using the pixel-wise similarityloss is justified as the Appearance Model changes the sourceimage’s style to match the style of the target image, thus al-lowing adequate registration of the two images.

Lflowrecon = LSSIM-L1

(t,F(s, t)� s) (8)

Flow Regularization. We also regularize the flow δp bypenalizing its spatial gradient, thus ensuring the smoothnessof the correspondence map generated by our spatial trans-former.

Lflowreg =

∥∥∇F(s, t)∥∥1

(9)

We prefer the L1 norm as it helps to stabilize training andresults in less noisy flow compared to the L2 norm.

3.4. Adversarial Loss

We introduce two latent discriminators, called Dstyle andDflow, for enforcing a prior distribution on the latent stylecodes and the flow codes generated during training, respec-tively. We use the adversarial loss function described inLS-GAN [29] for training the two discriminators along withStyle Encoder and Flow Encoder in an adversarial manner,as shown in Fig. 2.

Lstyleadv = Et∼Xdata

[(Dstyle(Estyle(t))− 1

)2]+ En∼N

[Dstyle(n)

2]

(10)

where t is sampled from the training images Xdata, and Nis the normal distribution. Similarly, we train the adver-sarial autoencoder (AAE) [28] on the flow fields generatedby Flow Model (Sec. 3.3) using LS-GAN loss and l1 recon-struction loss. The optimization details of AAE are includedin the Appendix.

3.5. Training Objective

Finally, the proposed training loss Ltotal joins the lossterms used to train Style Encoder, Appearance Model, andFlow Model:

Ltotal = Lvol cl + λ1Lapp + λ2Lflow + λ3Lstyleadv (11)

where λ’s are the weights of different losses. We observedthat pre-training the Style Encoder alone using the loss:Lvol cl + λ3Lstyle

adv improves the overall convergence rate andreduces the optimization’s complexity. After pre-trainingthe Style Encoder, we jointly optimize it along with Ap-pearance Model and Flow Model by minimizing Ltotal. Asensitivity test is included on different values of λ’s in theAppendix.

Generated image,

segmentation

Base seg. Random styled

base image

Base image

Appearance Model

Flow field

Flow Decoder

Warp

N(0,I) N(0,I)

Figure 3: Data Generation phase. We use the learned Ap-pearance Model and Flow Decoder to generate new im-ages and segmentation labels from the normal distribution.

3.6. Generation Phase

As shown in Fig. 3, we can sample new volumetric im-ages and their segmentation maps by mapping Gaussiannoise to novel style codes and flow fields using the trainedAppearance Model (A) and Flow Decoder (Gflow). The Ap-pearance Model performs style deformation of the base im-age. At the same time, Flow Decoder produces a randomflow field, which is then used to warp the styled base im-age and the base segmentation, thus generating new image-segmentation pairs.

X = Gflow(nflow)�A(b, nstyle) (12)Y = Gflow(nflow)� bseg

where b and bseg are the base image and base segmentationand nflow, nstyle ∼ N . X and Y are the generated image-segmentation pairs.

4. ExperimentsThis section introduces the implementation details, ex-

perimental settings, dataset, results, and ablation studies.Dataset, Preprocessing and Evaluation Metric. We

evaluate our method and other baselines on multi-studydatasets from publicly available datasets: CANDI [23] anda large-scale dataset, OASIS [30], each contains 3D T1-weighted MRI volumes. CANDI dataset consists of scansfrom 103 patients with manual anatomical segmentation la-bels, whereas the OASIS dataset has 2044 scans with seg-mentation labels obtained by the FreeSurfer [15] pipeline.As in VoxelMorph [3] and LT-Net [45], 28 anatomical struc-tures are used in our experiments. All the dataset images arefirst prepossessed by removing the brain’s skull region, fol-lowed by center cropping the volumes to 128× 160× 160.We set aside 20% brain images and their segmentation as a

5

Page 6: arXiv:2110.02117v1 [cs.CV] 5 Oct 2021

Cer

ebra

l-WM

Cer

ebra

l-CX

Late

ral-V

ent

Cer

ebel

lum

-WM

Cer

ebel

lum

-CX

Thal

amus

-Pro

per

Cau

date

Puta

men

Palli

dum

3rd-

Vent

4th-

Vent

Bra

in-S

tem

Hip

poca

mpu

s

Am

ygda

la

CSF

Vent

ralD

C

Mean±std

(CANDI)MABMIS 86.3 90.5 86.1 75.9 90.2 84.2 79.2 79.5 67.6 66.7 79.8 90.8 73.4 64.8 60.4 77.0 78.3±2.9VoxelMorph 81.1 87.3 83.7 69.9 82.4 85.9 82.6 81.0 74.7 64.3 73.4 89.2 66.2 66.7 59.4 81.1 76.9±3.2Bayesian 89.5 84.5 85.3 84.9 82.4 82.1 83.7 83.0 77.2 57.5 75.7 84.1 74.3 72.8 48.6 75.9 77.7±2.3DataAug 84.8 89.0 77.4 72.3 86.8 89.0 84.6 86.3 79.8 71.2 78.7 91.3 76.0 72.3 63.3 82.7 80.4±3.2LT-Net** 85.8 90.9 83.1 80.1 91.6 87.9 85.5 88.4 80.5 68.4 79.7 92.4 71.6 71.6 67.1 82.3 81.7±8.0Ours 90.9 94.3 89.2 83.4 89.5 89.2 88.3 86.7 81.1 71.3 81.9 92.2 76.0 70.9 67.6 82.2 83.5±3.0

3D U-Net* 94.1 97.0 93.8 89.6 96.8 91.5 90.4 90.8 83.0 74.4 87.0 94.9 84.1 79.0 79.6 88.1 88.1±1.5

(OASIS)MABMIS 79.4 62.1 84.4 77.1 82.6 85.3 77.5 78.9 66.5 81.5 62.3 88.4 72.5 72.9 63.6 78.3 75.8±6.9VoxelMorph 75.2 63.1 85.3 77.9 83.4 85.0 79.1 81.3 72.3 81.0 58.8 90.5 70.3 72.8 67.6 79.4 76.4±4.3Bayesian 90.0 45.8 64.7 82.4 72.9 84.1 70.1 70.7 40.6 33.4 18.8 87.8 48.3 56.9 33.0 62.1 59.5±3.0DataAug 74.8 69.0 69.2 67.2 80.1 78.0 61.8 66.2 50.4 70.5 50.6 83.2 64.8 56.8 56.8 69.2 66.8±5.7Ours 89.4 89.2 89.2 86.3 91.7 84.8 80.5 80.1 65.1 82.0 70.3 91.5 74.7 69.4 65.4 77.9 80.5±3.9

3D U-Net* 94.2 95.6 94.6 92.0 97.5 91.8 89.1 89.5 79.7 90.9 89.8 96.9 89.3 85.1 86.3 87.9 90.6±1.4

(OASIS→CANDI)w/o Style Adap. 84.9 88.4 58.9 75.3 90.6 78.8 55.5 47.9 40.6 55.5 65.5 83.3 55.2 44.8 44.6 60.9 64.4±3.5w/ Style Adap. 85.0 90.0 77.0 74.7 89.8 83.3 78.1 75.8 68.6 66.3 73.6 88.1 58.8 52.3 52.4 69.2 74.0±3.2

(CANDI→OASIS)w/o Style Adap. 80.2 83.1 58.1 32.4 71.4 37.2 39.6 20.6 6.2 3.8 7.6 12.5 12.0 17.9 2.3 12.6 31.1±11.7w/ Style Adap. 85.4 85.4 81.6 45.3 67.8 67.6 51.9 29.0 6.6 30.5 14.2 67.7 31.8 45.0 32.2 44.2 49.1±10.2

Table 1: Comparison of segmentation performance (mean Dice score in %) of MABMIS (2 atlas) [22], VoxelMorph [3],Bayesian [13], DataAug [52], LT-Net** [45], and Ours across various brain structures on CANDI and OASIS datasets.Abbreviations used: white matter (WM), cortex (CX), ventricle (Vent), and cerebrospinal fluid (CSF). ** as reported in thepublished paper. * fully supervised model. The last four rows (OASIS→CANDI and CANDI→OASIS) indicate the resultsof our method with (w/) and without (w/o) cross-site style adaptation.

test set, which was untouched during training. The remain-ing 80% brain images are then used for training and valida-tion, with a split of 90%-10%, in which there is no patientID overlap among the subsets. For each dataset, there aredifferent acquisition details and health conditions. The mostsimilar image to the anatomical average is selected as theonly annotated atlas (base image) used for the training. Weonly use validation set labels for choosing the best modeland hyper-parameters. For the OASIS dataset, the modelis trained with a mixture of healthy subjects and diseasedpatients and is then evaluated on test cases constitutes ofboth sets. We assess our method’s efficiency by training a3D U-Net-based segmentation model on the generated vol-umetric image-segmentation pairs and evaluating its perfor-mance on the untouched test data. We use the Dice similar-ity coefficient between the ground truth segmentation andthe predicted result in assessing the segmentation accuracy.

Experimental Settings. All our models are based on3D CNNs [21]. The Appearance model has 3 AdaIN[20] layers for performing style transfer using the stylecodes. Flow Model is a lighter version of VoxelMorph[3], while AAE Model has a 3D convolutional encoder-decoder architecture. We implement all the models inPyTorch [36]. The architectural details of all the mod-

MABMIS VoxelMorph Bayesian DataAug Ours

40.0

50.0

60.0

70.0

80.0

Dic

e Sc

ore

DatasetCANDIOASIS

Figure 4: Comparison of volume-wise segmentation accu-racy (Dice score %) of our method with MABMIS (2 atlas)[22], VoxelMorph [3], Bayesian [13] and DataAug [52].We outperform the second best baseline by a margin of3.1% on CANDI and 4.1% on OASIS dataset (p-value of5.8× 10−4, 3.2× 10−16 respectively using paired t-test).

els are included in the Appendix. For training Style En-coder, Appearance Model, Flow Model, and Flow Autoen-coder, we used Adam [24] optimizer with a learning rateof 2 × 10−4 and (β1 = 0.9, β2 = 0.999). We use the sameoptimizer with the same learning rate for training the la-tent style code and flow code’s discriminators but with

6

Page 7: arXiv:2110.02117v1 [cs.CV] 5 Oct 2021

CA

ND

IO

ASI

S

(a) (b) (c) (d) (e) (f) (g)

Figure 5: Qualitative comparison of our method with other baselines on the segmentation task of Brain MRI volumetricimages from CANDI and OASIS dataset. Left to right: (a) Ground Truth, the segmentation results from (b) MABMIS [22],(c) VoxelMorph [3], (d) Bayesian [13], (e) DataAug [52], (f) Fully Supervised 3D U-Net, (g) Ours. Best visualized in color.

(β1 = 0.5, β2 = 0.999). A hyper-parameter search wasconducted to find the optimal values. We choose the lossterm weights as λ1 = 5.0, λ2 = 1.0, λ3 = 0.1 andλreg = 0.1. The temperature coefficient τ for volumetriccontrastive loss is set to 0.7. We use a batch size of 32 forpre-training the Style Encoder and 4 when all the models aretrained end-to-end. We utilize the same experimental setupfor all baseline experiments for a fair comparison, e.g., thesame atlas.

Comparison with SOTA Methods. Our method sur-passes the state-of-the-art methods in most semantic classesand, on average, on both CANDI and OASIS datasets (seeFig. 4). A qualitative comparison of our method with sev-eral baselines on coronal brain slices’ segmentation task isshown in Fig. 5. MAMBIS [22] and VoxelMorph [3] pro-duce visibly noisy and inaccurate boundaries compared toother methods. Bayesian [13] and DataAug [52] strugglewith outer contours, where one crops it unnecessarily, andanother includes background. Furthermore, VoxelMorphand DataAug encounter difficulties in identifying small seg-mentation regions, whereas Bayesian overproduces them.Instead, our method handles outer/inner regions and smalleranatomical regions well and compares closely to super-vised/ground truth results. The qualitative observations are

Figure 6: Qualitative comparison of registration accuracyevaluated on segmentation for different methods. Left toRight: Ground-truth, VoxelMorph[3], MABMIS[22], Ours.

backed by quantitative metrics (see Table 1).

4.1. Ablations

Ablation Study on Appearance Model. The Appear-ance Model plays a crucial role in generating diversifiedstyled images and improving the Flow Model’s image reg-istration efficiency during training. Table 2 shows the effectof including-excluding the Appearance Model (Lapp) on theregistration accuracy of segmentation labels predicted by

7

Page 8: arXiv:2110.02117v1 [cs.CV] 5 Oct 2021

5% 25% 50% 100%30.0

40.0

50.0

60.0

70.0

80.0

Dic

e Sc

ore

DatasetCANDIOASIS

Figure 7: Ablation for segmentation accuracy of 3D U-Net using different sizes of generated data on CANDI andOASIS datasets. 100% implies 1850 generated image-segmentation volumes.

Different Flows −→

Diff

eren

tSty

les−→

Figure 8: Manipulating style and flow codes. Left to right:Images generated using the same style code with differentflow codes. Bottom to top: Images generated using the sameflow code and distinct style.

flow model and supervised segmentation accuracy of the 3DU-Net using the generated images. For both these scenar-ios, the average Dice score improves after the AppearanceModel’s inclusion. This is expected as registering two im-ages with similar intensity distributions is easier than regis-tering images with different intensities. We can only gen-erate similar styled images without the Appearance modelfor the generation phase, which leads to a poor generaliza-tion of 3D U-Net on the test set. A qualitative evaluation ofregistration on the CANDI test set is shown in Fig. 6.

Ablation Study on Joint Training. We observed thattraining our model end-to-end is critical for improved reg-

Ablation Type Method Mean±stdCANDI OASIS

w/o Lapp Reg. (a) 71.6±3.2 68.5±3.8

w/o Joint opt. Reg. (a) 73.6±3.3 69.9±3.7

w/ Joint opt., Lapp Reg. (a) 78.9±2.4 73.3±3.1

w/o Lapp Sup. (b) 54.9±20.4 41.2±22.7w/ Lapp Sup. (b) 83.5±3.0 80.5±3.9

Table 2: Effect of style transfer and joint training. MeanDice score of the segmentation task with and without Ap-pearance Model and joint training using the method as (a)registration of the base image to the target image (b) 3DU-Net trained on generated images-segmentation pairs.

istration accuracy (Table 2). We experimented with thepre-training Appearance model, Flow Model, and Style en-coder separately using the losses defined in Sec. 3 fol-lowed by fine-tuning and found significant improvementwhen Appearance Model and Flow Model are trained end-to-end. However, a pre-training Style encoder using volu-metric contrastive loss improves the overall model conver-gence (10× faster) without qualifying the performance ascompared to training it jointly.

Cross-Site One-Shot Adaptation. To evaluate the Ap-pearance model’s efficacy in generating the unlabeled tar-get dataset styles, we experimented with replacing the Ap-pearance Model trained on CANDI (OASIS) with the onetrained on OASIS (CANDI) to generate images in the styleof the target domain. The volumetric image-segmentationpairs generated are then used to train 3D U-Net, which isfurther evaluated on the target domain’s test samples. Asshown in Table 1, the domain gap between the two datasites would severely impede the generalization ability of thetrained supervised model on the source data site (w/o styleadaptation) while performing style adaptation between OA-SIS and CANDI dataset provides substantial improvements.We use the same test samples from CANDI and OASISdatasets for evaluation.

Exploring Diversity of the Generated Data. We eval-uate 3D U-Net’s performance on the segmentation task us-ing different data sizes generated by our approach. For thebest performing 3D U-Net, we report the accuracy obtainedby training it on 1850 generated image-segmentation pairs.Fig. 7 shows a box-plot of the Dice score (%) of the seg-mentation accuracy of 3D U-Net trained on different sizesof samples generated using our proposed method trainedon CANDI and OASIS. This suggests that our method haslearned to generate more diverse and realistic data. Besides,the effects of different flow codes and styles on generatedimages can be observed in Fig. 8.

8

Page 9: arXiv:2110.02117v1 [cs.CV] 5 Oct 2021

5. Conclusion and Future WorkWe proposed the novel volumetric contrastive loss used

for style transfer by leveraging unlabeled data for one-shot medical image segmentation. We presented a genericmethod adapted for cross-site one-shot segmentation sce-narios to generate arbitrarily diversified volumetric image-segmentation pairs using trained appearance models fromone data site and a flow model from another data site. Wedemonstrated state-of-the-art one-shot segmentation perfor-mance on two T1-weighted brain MRI datasets under vari-ous settings and ablations. We shed light on our method’sefficacy in closing the gap with a fully-supervised segmen-tation model in the extreme case of only one labeled atlas.Our method uses neither tissue nor modality-specific infor-mation and can be adjusted to other modalities or anatomy.As future work, our method can be easily extended for thefew-shot scenario using several atlases.

References[1] Zeynettin Akkus, Alfiia Galimzianova, Assaf Hoogi,

Daniel L Rubin, and Bradley J Erickson. Deep learning forbrain mri segmentation: state of the art and future directions.Journal of digital imaging, 30(4):449–459, 2017. 2

[2] Hossein Arabi, Nikolaos Koutsouvelis, Michel Rouzaud,Raymond Miralbell, and Habib Zaidi. Atlas-guided gener-ation of pseudo-ct images for mri-only and hybrid pet–mri-guided radiotherapy treatment planning. Physics in Medicine& Biology, 61(17):6531, 2016. 2

[3] Guha Balakrishnan, Amy Zhao, Mert R Sabuncu, John Gut-tag, and Adrian V Dalca. Voxelmorph: a learning frameworkfor deformable medical image registration. IEEE transac-tions on medical imaging, 38(8):1788–1800, 2019. 2, 4, 5,6, 7, 12

[4] Behzad Bozorgtabar, Mohammad Saeed Rad, Hazım KemalEkenel, and Jean-Philippe Thiran. Learn to synthesize andsynthesize to learn. Computer Vision and Image Understand-ing, 185:1–11, 2019. 1

[5] Behzad Bozorgtabar, Mohammad Saeed Rad, Hazım KemalEkenel, and Jean-Philippe Thiran. Using photorealistic facesynthesis and domain adaptation to improve facial expres-sion analysis. In 2019 14th IEEE International Conferenceon Automatic Face & Gesture Recognition (FG 2019), pages1–8. IEEE, 2019. 1

[6] Behzad Bozorgtabar, Mohammad Saeed Rad, DwarikanathMahapatra, and Jean-Philippe Thiran. Syndemo: Synergisticdeep feature alignment for joint learning of depth and ego-motion. In Proceedings of the IEEE/CVF International Con-ference on Computer Vision, pages 4210–4219, 2019. 1

[7] Ciprian Catana, Andre van der Kouwe, Thomas Benner,Christian J Michel, Michael Hamm, Matthias Fenchel,Bruce Fischl, Bruce Rosen, Matthias Schmand, and A Gre-gory Sorensen. Toward implementing an mri-based petattenuation-correction method for neurologic studies on themr-pet brain prototype. Journal of Nuclear Medicine,51(9):1431–1438, 2010. 2

[8] Krishna Chaitanya, Neerav Karani, Christian F Baumgart-ner, Anton Becker, Olivio Donati, and Ender Konukoglu.Semi-supervised and task-driven data augmentation. In In-ternational conference on information processing in medicalimaging, pages 29–41. Springer, 2019. 2

[9] Chen Chen, Chen Qin, Huaqi Qiu, Cheng Ouyang, ShuoWang, Liang Chen, Giacomo Tarroni, Wenjia Bai, andDaniel Rueckert. Realistic adversarial data augmentationfor mr image segmentation. In International Conference onMedical Image Computing and Computer-Assisted Interven-tion, pages 667–677. Springer, 2020. 2

[10] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge-offrey Hinton. A simple framework for contrastive learningof visual representations. arXiv preprint arXiv:2002.05709,2020. 2, 4

[11] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He.Improved baselines with momentum contrastive learning.arXiv preprint arXiv:2003.04297, 2020. 4

[12] Ozgun Cicek, Ahmed Abdulkadir, Soeren S Lienkamp,Thomas Brox, and Olaf Ronneberger. 3d u-net: learningdense volumetric segmentation from sparse annotation. InInternational conference on medical image computing andcomputer-assisted intervention, pages 424–432. Springer,2016. 3

[13] Adrian V Dalca, Evan Yu, Polina Golland, Bruce Fischl,Mert R Sabuncu, and Juan Eugenio Iglesias. Unsuperviseddeep learning for bayesian brain mri segmentation. In In-ternational Conference on Medical Image Computing andComputer-Assisted Intervention, pages 356–365. Springer,2019. 2, 3, 6, 7

[14] Mohamed S Elmahdy, Jelmer M Wolterink, Hessam Sokooti,Ivana Isgum, and Marius Staring. Adversarial optimiza-tion for joint registration and segmentation in prostate ct ra-diotherapy. In International Conference on Medical ImageComputing and Computer-Assisted Intervention, pages 366–374. Springer, 2019. 2

[15] Bruce Fischl. Freesurfer. Neuroimage, 62(2):774–781, 2012.5

[16] Jean-Bastien Grill, Florian Strub, Florent Altche, CorentinTallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch,Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh-laghi Azar, et al. Bootstrap your own latent-a new approachto self-supervised learning. Advances in Neural InformationProcessing Systems, 33, 2020. 2

[17] Changhee Han, Leonardo Rundo, Ryosuke Araki, Yujiro Fu-rukawa, Giancarlo Mauri, Hideki Nakayama, and HideakiHayashi. Infinite brain mr images: Pggan-based data aug-mentation for tumor detection. In Neural approaches to dy-namics of signal exchanges, pages 291–303. Springer, 2020.2

[18] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and RossGirshick. Momentum contrast for unsupervised visual rep-resentation learning. In Proceedings of the IEEE/CVF Con-ference on Computer Vision and Pattern Recognition, pages9729–9738, 2020. 2, 3, 4

[19] Yuting He, Tiantian Li, Guanyu Yang, Youyong Kong, YangChen, Huazhong Shu, Jean-Louis Coatrieux, Jean-Louis Dil-lenseger, and Shuo Li. Deep complementary joint model

9

Page 10: arXiv:2110.02117v1 [cs.CV] 5 Oct 2021

for complex scene registration and few-shot segmentation onmedical images. arXiv preprint arXiv:2008.00710, 2020. 2

[20] Xun Huang and Serge Belongie. Arbitrary style transfer inreal-time with adaptive instance normalization. In Proceed-ings of the IEEE International Conference on Computer Vi-sion, pages 1501–1510, 2017. 4, 6, 12

[21] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolu-tional neural networks for human action recognition. IEEEtransactions on pattern analysis and machine intelligence,35(1):221–231, 2012. 6

[22] Hongjun Jia, Pew-Thian Yap, and Dinggang Shen. Iterativemulti-atlas-based multi-image segmentation with tree-basedregistration. NeuroImage, 59(1):422–430, 2012. 6, 7

[23] David N Kennedy, Christian Haselgrove, Steven M Hodge,Pallavi S Rane, Nikos Makris, and Jean A Frazier. Can-dishare: a resource for pediatric neuroimaging data, 2012.5

[24] Diederik P Kingma and Jimmy Ba. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980,2014. 6

[25] Kelvin K Leung, Matthew J Clarkson, Jonathan W Bartlett,Shona Clegg, Clifford R Jack Jr, Michael W Weiner, Nick CFox, Sebastien Ourselin, Alzheimer’s Disease Neuroimag-ing Initiative, et al. Robust atrophy rate measurementin alzheimer’s disease using multi-site serial mri: tissue-specific intensity normalization and parameter selection.Neuroimage, 50(2):516–523, 2010. 2

[26] Maria Lorenzo-Valdes, Gerardo I Sanchez-Ortiz, Raad Mo-hiaddin, and Daniel Rueckert. Atlas-based segmentation andtracking of 3d cardiac mr images using non-rigid registra-tion. In International conference on medical image com-puting and computer-assisted intervention, pages 642–650.Springer, 2002. 2

[27] Dwarikanath Mahapatra, Behzad Bozorgtabar, and LingShao. Pathological retinal region segmentation from oct im-ages using geometric relation based augmentation. In Pro-ceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 9611–9620, 2020. 2

[28] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, IanGoodfellow, and Brendan Frey. Adversarial autoencoders.arXiv preprint arXiv:1511.05644, 2015. 3, 4, 5

[29] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, ZhenWang, and Stephen Paul Smolley. Least squares genera-tive adversarial networks. In Proceedings of the IEEE inter-national conference on computer vision, pages 2794–2802,2017. 5

[30] Daniel S Marcus, Tracy H Wang, Jamie Parker, John G Cser-nansky, John C Morris, and Randy L Buckner. Open accessseries of imaging studies (oasis): cross-sectional mri datain young, middle aged, nondemented, and demented olderadults. Journal of cognitive neuroscience, 19(9):1498–1507,2007. 5

[31] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi.V-net: Fully convolutional neural networks for volumetricmedical image segmentation. In 2016 fourth internationalconference on 3D vision (3DV), pages 565–571. IEEE, 2016.1, 2

[32] Pim Moeskops, Max A Viergever, Adrienne M Mendrik,Linda S De Vries, Manon JNL Benders, and Ivana Isgum.Automatic segmentation of mr brain images with a convolu-tional neural network. IEEE transactions on medical imag-ing, 35(5):1252–1261, 2016. 2

[33] Americo Oliveira, Sergio Pereira, and Carlos A Silva. Aug-menting data when training a cnn for retinal vessel segmen-tation: How to warp? In 2017 IEEE 5th Portuguese Meetingon Bioengineering (ENBENG), pages 1–4. IEEE, 2017. 2

[34] Sahin Olut, Zhengyang Shen, Zhenlin Xu, Samuel Gerber,and Marc Niethammer. Adversarial data augmentation viadeformation statistics. In European Conference on ComputerVision, pages 643–659. Springer, 2020. 2

[35] Cheng Ouyang, Carlo Biffi, Chen Chen, Turkay Kart, HuaqiQiu, and Daniel Rueckert. Self-supervision with superpix-els: Training few-shot medical image segmentation withoutannotation. In European Conference on Computer Vision,pages 762–780. Springer, 2020. 2

[36] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,James Bradbury, Gregory Chanan, Trevor Killeen, ZemingLin, Natalia Gimelshein, Luca Antiga, Alban Desmaison,Andreas Kopf, Edward Yang, Zachary DeVito, Martin Rai-son, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner,Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An im-perative style, high-performance deep learning library. In H.Wallach, H. Larochelle, A. Beygelzimer, F. d'Alche-Buc, E.Fox, and R. Garnett, editors, Advances in Neural Informa-tion Processing Systems 32, pages 8024–8035. Curran Asso-ciates, Inc., 2019. 6

[37] Sergio Pereira, Adriano Pinto, Victor Alves, and Carlos ASilva. Brain tumor segmentation using convolutional neu-ral networks in mri images. IEEE transactions on medicalimaging, 35(5):1240–1251, 2016. 2

[38] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmen-tation. In International Conference on Medical image com-puting and computer-assisted intervention, pages 234–241.Springer, 2015. 1

[39] Zhengyang Shen, Zhenlin Xu, Sahin Olut, and Marc Ni-ethammer. Anatomical data augmentation via fluid-basedimage registration. In International Conference on Medi-cal Image Computing and Computer-Assisted Intervention,pages 318–328. Springer, 2020. 2

[40] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypi-cal networks for few-shot learning. In Advances in neuralinformation processing systems, pages 4077–4087, 2017. 2

[41] Devavrat Tomar, Manana Lortkipanidze, Guillaume Vray,Behzad Bozorgtabar, and Jean-Philippe Thiran. Self-attentive spatial adaptive normalization for cross-modalitydomain adaptation. IEEE Transactions on Medical Imaging,2021. 1

[42] Devavrat Tomar, Lin Zhang, Tiziano Portenier, and Or-cun Goksel. Content-preserving unpaired translation fromsimulated to realistic ultrasound images. arXiv preprintarXiv:2103.05745, 2021. 1

[43] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. In-stance normalization: The missing ingredient for fast styliza-tion. arXiv preprint arXiv:1607.08022, 2016. 12

10

Page 11: arXiv:2110.02117v1 [cs.CV] 5 Oct 2021

[44] Kaixin Wang, Jun Hao Liew, Yingtian Zou, Daquan Zhou,and Jiashi Feng. Panet: Few-shot image semantic segmenta-tion with prototype alignment. In Proceedings of the IEEEInternational Conference on Computer Vision, pages 9197–9206, 2019. 2

[45] Shuxin Wang, Shilei Cao, Dong Wei, Renzhen Wang, KaiMa, Liansheng Wang, Deyu Meng, and Yefeng Zheng. Lt-net: Label transfer by learning reversible voxel-wise corre-spondence for one-shot medical image segmentation. In Pro-ceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 9162–9171, 2020. 2, 3, 5, 6

[46] Yude Wang, Jie Zhang, Meina Kan, Shiguang Shan, andXilin Chen. Self-supervised equivariant attention mecha-nism for weakly supervised semantic segmentation. In Pro-ceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 12275–12284, 2020. 2

[47] Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multi-scale structural similarity for image quality assessment. InThe Thrity-Seventh Asilomar Conference on Signals, Sys-tems & Computers, 2003, volume 2, pages 1398–1402. Ieee,2003. 4

[48] Zhirong Wu, Yuanjun Xiong, X Yu Stella, and Dahua Lin.Unsupervised feature learning via non-parametric instancediscrimination. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2018. 4

[49] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empiricalevaluation of rectified activations in convolutional network.arXiv, pages arXiv–1505, 2015. 12

[50] Zhenlin Xu and Marc Niethammer. Deepatlas: Joint semi-supervised learning of image registration and segmenta-tion. In International Conference on Medical Image Com-puting and Computer-Assisted Intervention, pages 420–429.Springer, 2019. 2

[51] Heran Yang, Jian Sun, Huibin Li, Lisheng Wang, and Zong-ben Xu. Neural multi-atlas label fusion: Application to car-diac mr images. Medical image analysis, 49:60–75, 2018.2

[52] Amy Zhao, Guha Balakrishnan, Fredo Durand, John V Gut-tag, and Adrian V Dalca. Data augmentation using learnedtransformations for one-shot medical image segmentation. InProceedings of the IEEE conference on computer vision andpattern recognition, pages 8543–8553, 2019. 2, 3, 6, 7

11

Page 12: arXiv:2110.02117v1 [cs.CV] 5 Oct 2021

6. Appendix6.1. Sensitivity Test

We conduct a sensitivity test on our training objective’s hyperparameters (λ’s) for CANDI and OASIS datasets in Table 3.To be consistent, we chose the same λ’s values that are optimized for the OASIS dataset for the final quantitative evaluation.

Dataset Method λ1 = 1.0 λ1 = 10.0 λ2 = 2.0 λ2 = 5.0 λ3 = 0.2 λ3 = 0.5

Mean±stdOASIS Reg. (a) 69.7±5.0 70.9±5.6 71.6±4.9 70.9±5.1 72.5±4.9 71.9±5.0

Sup. (b) 76.0±4.1 77.5±5.3 78.6±4.3 76.1±5.3 79.8±4.6 77.6±4.7

CANDI Reg. (a) 77.7±2.9 77.8±3.0 77.6±3.1 77.5±3.2 78.9±2.5 78.6±2.6Sup. (b) 84.4±2.0 85.7±2.6 86.1±2.0 85.5±2.1 86.0±2.1 84.8±2.0

Table 3: Sensitivity test for different values of λ’s for the proposed training loss. We change one hyperparameter at a timewhile other hyperparameters’ values are kept the same. Method Reg. (a) denotes registration accuracy of the base image andthe target image; Method Sup. (b) denotes accuracy obtained by 3D U-Net trained on generated image-segmentation pairs.

6.2. The Architectural Details

Throughout this appendix, we introduce several notations in the figures defined as:

• Conv3D denotes a 3D convolution layer.

• ConvT3D denotes a 3D transpose convolution layer.

• ks denotes the size of the kernel.

• stride denotes the shift of the convolutional kernel in pixels while performing convolution.

• nf denotes the number of output features.

• Linear denotes a fully connected multi-layer perceptron.

• Upsample denotes the upsampling of the features in the spatial dimensions.

• Concat denotes the concatenation of feature maps of two layers.

6.2.1 Style Encoder’s Architecture

Style Encoder consists of 3D convolutional layers along with 3D Instance Normalization [43] and Parametric ReLU [49](Leaky ReLU) as the nonlinear activation with parameter 0.2. The complete architecture is given in Fig. 9.

InputImage

Con

v3D

ks: 7x7x7

strid

e: 4

nf: 16

Inst

ance

Nor

m3D

Leak

y R

eLU

Con

v3D

ks: 3x3x3

strid

e: 2

nf: 32

Inst

ance

Nor

m3D

Leak

y R

eLU

Con

v3D

ks: 3x3x3

strid

e: 2

nf: 64

Inst

ance

Nor

m3D

Leak

y R

eLU

Con

v3D

ks: 3x3x3

strid

e: 2

nf: 128

Inst

ance

Nor

m3D

Leak

y R

eLU

Con

v3D

ks: 1x1x1

strid

e: 1

nf: 64

Linear

nf: 128

Leak

y R

eLU

Linear

nf: 128

Leak

y R

eLU

Linear

nf: 128

Flatten

Styl

e C

ode

Figure 9: The Architecture of the Style Encoder.

6.2.2 Appearance Model’s Architecture.

The appearance model comprises 3D convolutions, Leaky ReLU activations with parameter 0.2, and AdaIN [20] layers. Thearchitecture of the Appearance Model is given in Fig. 10.

6.2.3 Flow Model’s Architecture.

The Flow Model is a lighter version of 3D VoxelMorph [3] whose architecture is shown in Fig. 11:

12

Page 13: arXiv:2110.02117v1 [cs.CV] 5 Oct 2021

Con

vAda

INnf

: 32

Line

arnf

: 128

Leak

y R

eLU

Line

arnf

: 128

Leak

y R

eLU

Styl

e C

ode

Con

v3D

ks: 7

x7x7

strid

e: 1

nf: 1

6

Leak

y R

eLU

Con

v3D

ks: 5

x5x5

strid

e: 2

nf: 3

2

Inst

ance

Nor

m3D

Leak

y R

eLU

Con

v3D

ks: 5

x5x5

strid

e: 2

nf: 3

2

Inst

ance

Nor

m3D

Leak

y R

eLU

ResnetBlock

InputImage

Con

v3D

ks: 3

x3x3

strid

e: 1

nf: 3

2

Leak

y R

eLU

Con

v3D

ks: 3

x3x3

strid

e: 1

nf: 3

2

Leak

y R

eLU

Res

netB

lock

DownLayers

Res

netB

lock

Ups

ampl

e 2

Con

vAda

INnf

: 32

Ups

ampl

e 2

UpLayers

Con

vAda

INnf

: 16

Con

v3D

ks: 3

x3x3

strid

e: 1

nf: 1 Output

Image

Appearance Model

Conv3Dks: 3x3x3stride: 1

nf

AdaIN

Leaky ReLU

ConvAdaIN

Tanh

Figure 10: The Architecture of the Appearance Model. Appearance Model receives an image and style code and performsstyle manipulation of the input image based on the style code. ResnetBlock and ConvAdaIn blocks are shown separately.

Con

v3D

ks: 5

x5x5

strid

e: 2

nf: 1

6

Leak

y R

eLU

Con

v3D

ks: 5

x5x5

strid

e: 2

nf: 3

2

Leak

y R

eLU

Con

v3D

ks: 5

x5x5

strid

e: 2

nf: 3

2

Leak

y R

eLU

Con

v3D

ks: 5

x5x5

strid

e: 2

nf: 3

2

Leak

y R

eLU

Con

v3D

ks: 5

x5x5

strid

e: 1

nf: 3

2

Leak

y R

eLU

Con

v3D

ks: 5

x5x5

strid

e: 1

nf: 3

2

Leak

y R

eLU

Con

cat

Ups

ampl

e 2

Ups

ampl

e 2

Con

v3D

ks: 5

x5x5

strid

e: 1

nf: 3

2

Leak

y R

eLU

Con

cat

Ups

ampl

e 2 Con

v3D

ks: 5

x5x5

strid

e: 1

nf: 1

6

Leak

y R

eLU

Con

cat

Ups

ampl

e 2

Con

v3D

ks: 5

x5x5

strid

e: 1

nf: 1

6

Leak

y R

eLU

Con

v3D

ks: 5

x5x5

strid

e: 1

nf: 3

FixedImage

MovingImage Warp Output

Image

Figure 11: The Architecture of the Flow Model. We feed the Moving Image and Fixed Image to our Flow Model to obtainthe corresponding flow field, which is then used to warp the Moving Image into the Fixed Image.

6.2.4 The Architecture of the Flow Adversarial Auto-Encoder

The architecture of the Flow Adversarial Autoencoder is shown in Fig. 12.

6.2.5 Latent Discriminators

For the latent discriminators, we use fully connected multi-layer perceptrons. The architectures of Latent Style Code Dis-criminator Dstyle and Latent Flow Code Discriminator Dflow are shown in Fig. 13.

6.2.6 3D U-Net

The architecture of the 3D U-Net is shown in Fig. 14.

13

Page 14: arXiv:2110.02117v1 [cs.CV] 5 Oct 2021

Con

v3D

ks: 5

x5x5

strid

e: 2

nf: 3

2

Inst

ance

Nor

m3D

Leak

y R

eLU

Con

v3D

ks: 5

x5x5

strid

e: 2

nf: 6

4

Inst

ance

Nor

m3D

Leak

y R

eLU

Con

v3D

ks: 5

x5x5

strid

e: 2

nf: 1

28

Inst

ance

Nor

m3D

Leak

y R

eLU

Con

v3D

ks: 5

x5x5

strid

e: 2

nf: 2

56

Inst

ance

Nor

m3D

Leak

y R

eLU

Con

v3D

ks: 4

x4x4

strid

e: 2

nf: 6

4FlowField

Flow

Cod

e

Con

vT3D

ks: 4

x4x4

strid

e: 2

nf: 2

56

Inst

ance

Nor

m3D

Leak

y R

eLU

Flow

Cod

e

Con

vT3D

ks: 4

x4x4

strid

e: 2

nf: 1

28

Inst

ance

Nor

m3D

Leak

y R

eLU

Con

vT3D

ks: 4

x4x4

strid

e: 2

nf: 6

4

Inst

ance

Nor

m3D

Leak

y R

eLU

Con

vT3D

ks: 4

x4x4

strid

e: 2

nf: 3

2

Inst

ance

Nor

m3D

Leak

y R

eLU

Con

vT3D

ks: 4

x4x4

strid

e: 2

nf: 3

2

Inst

ance

Nor

m3D

Leak

y R

eLU

Con

v3D

ks: 3

x3x3

strid

e: 1

nf: 3 Flow

Field

Flow Encoder

Flow Generator

Figure 12: The Architecture of the Flow AAE. Flow Encoder encodes the given Flow Field into latent Flow Code whileFlow Generator reconstructs the same Flow Field using the corresponding latent Flow Code. We regularize the Flow codeusing an adversarial loss.

2

Linear

nf: 128

Leak

y R

eLU

Linear

nf: 128

Leak

y R

eLU

Linear

nf: 128

Leak

y R

eLU

Linear

nf: 128

Leak

y R

eLU

Linear

nf: 128

Leak

y R

eLU

Linear

nf: 1

Styl

e C

ode

fake/real

Linear

nf: 256

Leak

y R

eLU

Linear

nf: 256

Leak

y R

eLU

Linear

nf: 1

Flow

Cod

e

fake/real

Dstyle

Dflow

Figure 13: The Architectures of the Latent Style and Latent Flow Discriminators.

Dou

ble

Con

v3D

nf: 3

2

Con

v3D

ks: 4

x4x4

strid

e: 2

nf: 3

2

Con

v3D

ks: 4

x4x4

strid

e: 2

nf: 6

4

Con

v3D

ks: 4

x4x4

strid

e: 2

nf: 1

28

Image

Ups

ampl

e 2

Con

v3D

ks: 3

x3x3

strid

e: 1

nf

Leak

y R

eLU

Con

v3D

ks: 3

x3x3

strid

e: 1

nf

Leak

y R

eLU

Double Conv3D

Dou

ble

Con

v3D

nf: 6

4

Dou

ble

Con

v3D

nf: 1

28

Dou

ble

Con

v3D

nf: 2

56

Con

v3D

ks: 4

x4x4

strid

e: 2

nf: 2

56

Dou

ble

Con

v3D

nf: 5

12

Con

v3D

ks: 1

x1x1

strid

e: 1

nf: 2

56

Con

cat

Ups

ampl

e 2

Con

v3D

ks: 1

x1x1

strid

e: 1

nf: 1

28

Con

cat

Dou

ble

Con

v3D

nf: 2

56 Ups

ampl

e 2

Con

v3D

ks: 1

x1x1

strid

e: 1

nf: 6

4

Dou

ble

Con

v3D

nf: 1

28

Con

cat

Ups

ampl

e 2

Con

v3D

ks: 1

x1x1

strid

e: 1

nf: 3

2

Dou

ble

Con

v3D

nf: 6

4

Con

cat

Dou

ble

Con

v3D

nf: 3

2

Con

v3D

ks: 1

x1x1

strid

e: 1

nf: 1

7

Output

Figure 14: The architecture of the 3D U-Net.

14

Page 15: arXiv:2110.02117v1 [cs.CV] 5 Oct 2021

6.3. Optimization Details for the Flow AAE

The Flow Adversarial Autoencoder (Flow AAE) corresponding to the base image is trained by optimizing the followingobjective:

minEflow

Gflow

maxDflow

Ef∼Xflow

[∥∥f −Gflow(Eflow(f))∥∥1+ µ(Dflow(Eflow(f))− 1)2

]+ En∼N

[µ(Dflow(n))

2]

where Xflow is the distribution of the flow field as generated by the Flow Model corresponding to the base image, Gflow andEflow represents the Flow Decoder, and Flow Encoder of the Flow AAE model, Dflow is the latent flow code discriminator,N is the normal distribution, and µ is the trade-off weight, and we set µ = 0.1.

6.4. Additional Qualitative Results

6.4.1 Linear interpolation in the Flow Latent Space

Fig. 15 shows the efficacy of our proposed method in obtaining a linear Flow Latent Space. We observe a smooth transitionof the image-segmentation pairs generated using a convex combination of two different latent flow codes.

6.4.2 The t-SNE Projection of the Style Codes

Fig. 16 shows a 2D projection of the style latent codes and corresponding images obtained by our method using self-supervised volumetric contrastive loss. We observe that images with a similar style are clustered together.

6.4.3 Flow Fields Examples

Fig. 17 shows sample results of the deformation field applied on grid images.

15

Page 16: arXiv:2110.02117v1 [cs.CV] 5 Oct 2021

Figure 15: A linear walk in the Flow Latent Space. Images and segmentation labels are generated by linearly interpolatingbetween two different flow latent codes. Left to right: Linear interpolation of the flow latent codes from the first column tothe last column using the same style.

16

Page 17: arXiv:2110.02117v1 [cs.CV] 5 Oct 2021

Figure 16: The t-SNE 2D projection of the style codes. Self-supervised clustering of similar styled images is shown.

(a) (b) (c)

Figure 17: Deformation fields applied on grid images. Left: norm of vector flow fields. Right: flow field applied on gridimage.

17