tetris: template transformer networks for image ... · the concept of template transformer networks...

This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2019.2905990, IEEETransactions on Medical Imaging

IEEE TRANS. MED. IMAGING 1

TETRIS: Template Transformer Networksfor Image Segmentation with Shape Priors

Matthew Chung Hai Lee1,2, Kersten Petersen1, Nick Pawlowski2, Ben Glocker1,2,*, and Michiel Schaap1,2,*

1HeartFlow, California, CA 94063, USA2Biomedical Image Analysis Group, Imperial College London, London, UK

*Shared senior [email protected]

Abstract— In this paper we introduce and compare differentapproaches for incorporating shape prior information into neuralnetwork based image segmentation. Specifically, we introducethe concept of template transformer networks where a shapetemplate is deformed to match the underlying structure of inter-est through an end-to-end trained spatial transformer network.This has the advantage of explicitly enforcing shape priors andis free of discretisation artefacts by providing a soft partialvolume segmentation. We also introduce a simple yet effectiveway of incorporating priors in state-of-the-art pixel-wise binaryclassification methods such as fully convolutional networks andU-net. Here, the template shape is given as an additional inputchannel, incorporating this information significantly reduces falsepositives. We report results on synthetic data and sub-voxelsegmentation of coronary lumen structures in cardiac computedtomography showing the benefit of incorporating priors in neuralnetwork based image segmentation.

Index Terms—Image Segmentation, Shape Priors, Neural Net-works, Template Deformation, Image Registration

I. INTRODUCTION

SEGMENTATION of anatomical structures can be greatlyimproved by incorporating priors on shape, assuming

population wide regularities are observed, or that expertknowledge is available. Shape priors help to reduce the searchspace of potential solutions for machine learning algorithms,improving the accuracy and plausibility of solutions [1].Priors are particularly useful when data is ambiguous, corrupt,exhibits low signal-to-noise or if training data is scarce.

Some of the first attempts to explicitly enforce shape priorsin segmentation pipelines made use of deformable templates[2], combining image registration with a shape template to per-form segmentation. Subsequently this method was combinedwith anatomical atlases to perform segmentations of differentorgans [3]–[5]. However these methods require either animage-to-image or image-to-segmentation likelihood functionto drive atlas matching or alignment of the deformation model.Statistical methods such as active shape models [6] have beenexplored extensively with the difficulty of constructing shapemodels in the first place which are then often limited in theirexpressiveness due to the underlying manifold learning method(linear or non-linear principal component analysis).

State-of-the-art neural network based segmentation models[7]–[10] typically optimize pixel-wise loss functions suchas mean squared error or cross entropy, and more recently

differentiable Dice [11]. These objective functions do not takeexplicit priors into consideration during training. Neverthe-less, smoothness priors can be enforced during test time byusing conditional random fields or similar post-processingtechniques. More recent work has shown improved resultsby directly incorporating shape constraints into their learningalgorithm rather than applying them as post-processing suchas in [12]–[14], where priors are learnt to regularize neuralnetwork embeddings during training. While this can leadto networks that favour plausible segmentations, there is noguarantee that the outputs adhere to desired shape constraints,such as a single connected component or a closed surface.

Contributions

In this paper we introduce a new neural network modelbased on template deformations which utilises spatial trans-former networks [15]. Our model leverages the representa-tional power of neural networks while explicitly enforcingshape constraints in segmentations by restricting the model toperform segmentation through deformations of a given shapeprior. We call this Template Transformer Networks for ImageSegmentation (TETRIS). As with template deformations, ourmethod produces anatomically plausible results by regularizingthe deformation field. This also avoids discretisation artefactsas we do not restrict the network to make pixel-wise classifi-cations. By using a neural network that is trained to align theshape prior to the structure of interest visible in the input imagethere is no need for a hand crafted (intensity-based) image-to-segmentation registration measure as with other templatedeformation models. To the best of our knowledge, this is thefirst full 3D neural network-based image segmentation throughregistration method combining deep convolutional neural netswith spatial transformers.

Another contribution of this paper is the demonstration thatstate-of-the-art segmentation algorithms can be easily extendedto incorporate implicit shape priors by providing a shapetemplate as additional input during training. To the best ofour knowledge, this simple yet effective enhancement has notbeen considered in the past. Our benchmarking shows that thiscan lead to a significant increase in segmentation accuracy, ahigh level graphical overview of these methods are given inFigure 1.

[email protected]




BR

P B

CorruptedImage

NeuralNetwork

CandidateSolutions

P

CorruptedImage

NeuralNetwork

PriorIgnored

CorruptionVoxel-wisePrediction

EnlargedPrior

DeformationModel BB

CorruptedImage

NeuralNetworkPrior

B B

BB

BTraditional Model Naive Model Proposed Model

Fig. 1. Schematic to illustrate the differences between traditional, pixel-wise segmentation models, the naive way of incorporating priors through additionalinput, and our TETRIS model which produces a set of parameters for a transformation. By restricting the output space of the network to only deformationsof the prior, we obtain guarantees on topology.

We present promising results on coronary artery seg-mentation from cardiac computed tomography which furtherstrengthens the case for the use of priors in medical imagesegmentation with deep neural nets. Experimentally we showthat all methods which utilise prior information are able toconsistently improve the cross entropy score of segmentation,and that our method is able to retain a singly connectedcomponent segmentation. Our quantitative results show thevarying strengths and weakness of the two introduced meth-ods. We also present qualitative results on synthetic examplesto demonstrate the effects of out-of-sample data and how thisaffects neural network segmentations.

Related Work

Our proposed model lies in the intersection of machinelearning based image registration and segmentation, and theincorporation of shape priors into neural networks. The fol-lowing discusses the most related work but due to spacelimitations and the large amount of work in these fields, thiscannot provide a comprehensive overview.

Atlas and Registration Based Segmentation: Atlas basedsegmentation algorithms [16] are among the most popularmethods and rely on two key components, an intensity-based image-to-image matching term Lη and a set of trainingexamples, i.e. the atlases, with corresponding labels. Duringtesting, images can be compared to examples in the atlasdataset using Lη and the label mask of the most similar atlasesare selected as candidate segmentations. This concept can beextended to employ patch based techniques or advanced labelfusion procedures and is robust when label boundaries occurin homogeneous regions. However, this method often offersa coarse segmentation, which may lack precision and can berefined using linear and nonlinear registration. This refinementcan be done in either image or segmentation space [17]. Theauthors of [18] proposed a combination of an image-to-imageand segmentation-to-segmentation likelihood function, using aLagrange multiplier to weight the contribution of each term.If an image-to-segmentation likelihood function is used thenthis approach is better referred to as template registration [19].

Statistical Shape Models: Active shape models as intro-duced in [20] explicitly model shape based on training ex-amples. By discretizing the k dimensional shapes using ncontrol points, they create point distribution models usingan ellipsoid prior. Principle modes of variation can then be

found using principle component Analysis (PCA) [21]. Newshapes can be represented as a linear weighting of thesecomponents. Additionally, by restricting the model to uset principle components where t < kn and restricting therange of values each linear weighting can have, a valid shapespace is produced. Active appearance models [22] build onthis technique and jointly model appearance together withshape. These models have been use widely in the medicalimaging community to perform segmentation [6]. However,such models are heavily biased by the distribution of thetraining set used to build them.

Network Based Image Registration: Traditional registrationalgorithms take two images, a moving M and fixed F andperform registration by iteratively updating some parametrizedtransformation Tθ which maps image grid locations to eachother, such that some loss function Lη(M ◦ Tθ,F) is min-imized, where bespoke parameters θ are found for a givenpair of images during test time. Optimization of the algorithmcan be considered as optimizing η, some parametrization ofthe loss function which results in the ‘best’ registrations (suchas the values of Lagrange multipliers) or by optimizing thechoice of T (the transformation family expressible).

The key difference between neural network based imageregistration and traditional, iterative registration algorithmsis that the loss function is only computed during trainingfor neural networks. The parameters of the neural networkimplicitly encode what transformation, conditioned on theinput, is needed to register the image with minimal costinstead of repeatedly calculating a loss to iteratively updatethe parameters θ.

Recent works on neural network based image registrationfall into two major categories, the first treats network basedregistration as a regression problem on a given ground truthdeformation field such as in [23]–[27]. These methods, unlikeours, can be used as fast approximations to other registrationmodels. The second group of methods learn the deformationfield implicitly while optimising a downstream task. For ex-ample, [28] combine momentum-parametrization for LDDMMshooting [29] and neural networks to learn an end-to-endmodel for registration. Reinforcement learning approacheshave also been used to perform image registration [30]. Theytreat the registration problem as an iterative update of fourtranslation and two rotation parameters so do not handle freeform deformations.




θ Tθ(G)

Shape Prior (U)

Image (I)

Output (V ) TargetSampler

Grid Generator

TransformationParametersNeural Network

Loss calculation

Fig. 2. TETRIS takes as input an image and a shape prior in the form of a partial volume image and produces a set of parameters for a transformation, thistransformation is then applied to the prior and the loss is calculated on the deformed prior and the target segmentation during training.

Related network-based registration methods [31] and [32]use 2D spatial transformer networks to embed the deformationmodel into a neural network pipeline in order to learn theregistration model end-to-end. The latter of which uses aFlowNet architecture [24]. This work was built on in [33]which performs full unsupervised 3D registration. However,unlike our model, all of these methods perform image-to-image registration rather than template deformation for adownstream segmentation task. The authours of [34], [35] be-gin to investigate template deformations but do not investigatethis to its full 3D potential.

Shape Priors in Neural Networks: Finally we discussmethods which incorporate shape priors into neural networks.Though conditional random fields are considered smooth-ness priors, they do assist in providing shape consistencyin segmentations. CRFs are incorporated into the trainingprocess in [12] by casting CRFs as recurrent neural networks.This allows the segmentation and refinement model to betrained end-to-end. Adversarial training was used in [36] asa means of learning such regularisation without the need ofan explicit model whilst still being able to train end-to-end. Adiscriminator network was used to distinguish segmentationsfrom a network and ground truth segmentations, this trainingprocess encourages the network segmentations to look moreplausible. An interleaving process was proposed in [37] whereiterative training of a neural network and CRF refinementwas performed inspired by the grab cut [38] method, thoughthis model was not trained end-to-end. More recent workhas shown improved results by directly incorporating prin-ciple component based shape constraints into their learningalgorithm. Building on the work of active shape models [20],the authors of [13] use a PCA layer embedded in the neuralnetwork to restrict its output space to be weightings of theprinciple components, this was extended to a probabilisticmodel in [39]. Another approach proposed in [14] exploitsthe fact that autoencoders are able to capture a low dimen-sional representation of the shapes of segmentation maps.This encoding is then used at training time to constrain theoutputs of a segmentation network to be close to this lowdimensional manifold via adversarial training. The latter twomethods utilise anatomical consistency across subjects.

Spatial Transformer Networks: Our work builds heavily onspatial transformer networks (STNs) [15] which we describebelow. STNs are a neural network model that, conditioned onsome input I returns θ for some parametrized transformationmodel Tθ(G). That is θ = fψ(I) where fψ is a neural network,itself parametrized by ψ. Once we have θ, we are able todifferentiably re-sample our image I to V using Tθ, as withimage registration, here the image itself is re-sampled. TheSTN model then passes the re-sampled image V to anotherneural network, gξ(V ) which performs some down streamtask. In [15], they utilise this powerful model to train gξ,which performs a classification task and simultaneous train fψa deformation model that makes the down stream task easiervia a combination of rescaling, region of interest extractionand rotation of the input images.

During training, the loss is calculated on the down streamtask only, for classification this could be the cross entropy lossbetween the predicted class produced by gξ(V ) and the trueclass. Since the neural network gξ is a differentiable functionand sampling from I to V is also differentiable we are ableto train both tasks end to end. Inherently a spatial transformeris performing deformations that assist the down stream task,as opposed to having a loss calculated directly on the taskof deforming. This can be considered an implicit registrationstep, where the registration is autonomously discovered by thenetwork for optimal downstream performance. We do not haveto decide what kind of deformation will be good for the task,though we do need to specify the family of deformations T .During test time, unlike iterative registration models, no lossvalue needs to be calculated. We simply need to perform aforward pass through the network to get both the deformationand the class prediction

II. TEMPLATE TRANSFORMER NETWORKS

Traditional template deformation models require the defi-nition of an image-to-segmentation matching function as anapproximation or surrogate to the actual segmentation objec-tive. Iterative optimisation is then used to incrementally updatethe transformation parameters in order to maximize agreementbetween a template and the image to be segmented. In con-trast, our method makes use of network based registration,which only requires the computation of a corresponding loss




Convolution (64, 3, 1)

residual unit(64, 3, 2)


Tri-linear upsampling

sigmoid activation

Concatination



Concatination


convolution (1, 1, 1)

U-Net




FCN



residual matching unit(128)





sigmoid activation TemplateTransformer Network

bottle neck residual unit(64, 3, 1)

max pooling(2)

bottle neck residual unit(64, 3, 1)

batch normalization

leaky relu activation

convolution(64, 3, 1)

leaky relu activation

convolution(3, 3, 1)

max pooling(2)

Fig. 3. Graphical representation of the three different models explored in this paper. from left to right, the U-Net, FCN and the black box model used toproduce deformation parameters for TETRIS. Building blocks are described in Figure 12.

function (equivalent to the matching function) during trainingtime. This important difference means we no longer need toapproximate our actual segmentation function via an intensity-based surrogate and can directly optimise for the task at hand.

We introduce a novel template deformation model thatexploits the power of neural network-based registration. Ourend-to-end model takes a shape prior in the form of a partialvolume image (PVI) and an image as input to a neural networkwhich learns to deform the input prior so as to produce anaccurate segmentation of the input image. This is done byimplicitly estimating a deformation field so as to maximisetemplate alignment corresponding to optimal segmentationaccuracy. We provide a detailed description of the main stepsbelow. An overview of our method is shown in Figure 2. Inthe following subsections (II-B, II-C and II-D) we describe indetail how we perform deformations, how we regularise ourdeformation field and how we handle large volume sizes.

A. Obtaining Shape Templates

Shape priors can be utilised in neural networks in variousforms such as in the form of level sets, PVIs, binary masks oras shape parameters (e.g., mesh control points). In this workwe focus on the use of a deformation model, conditionedon a shape prior to deform a PVI into another PVI. Ourshape prior itself in this particular case is also a PVI but weemphasize this is not a necessity and richer priors such asstatistical appearance models can also be used. As templatetransfer networks predict a transformation instead of a point-wise segmentation map, they lend themselves naturally to theability of using other geometric representations for the priorssuch as mesh-based models. Shape priors can be generallyobtained via manual, semi-automatic and automatic methodsand the exact mechanism is application specific. We will later

discuss one particular approach for obtaining shape priors forthe application of coronary artery segmentation.

B. Deformation Model

To deform a template, consider some input image I , shapeprior U , ground truth segmentation T all of size H×W ×D.We have a sampling scheme (or deformation model) Tθ(G)where G is considered a standard co-ordinate grid, and lossfunction Lη . Tθ(G) is a function on (xti, y

ti , z

ti), grid coordi-

nates in our target space, that maps to (xsi , ysi , z

si ) co-ordinates

in our original source space, where we index voxel locationsby i ∈ [1, . . . ,H ′W ′D′] for notational simplicity. Given thiswe can define V , a re-sampling of a prior U , based on Tθ as

Vi =H∑n

W∑m

D∑l

Umnl × k(xsi −m; Φx)

× k(ysi − n; Φy)

× k(zsi − l; Φz)∀i ∈ [1, . . . ,H ′W ′D′](1)

Where k is any sampling kernel with parameters Φ. Forimage interpolation we use a trilinear kernel to prevent re-sampled pixel values from being extrapolated to outside ofthe original intensity domain, that is

Vi =H∑n

W∑m

D∑l

Umnl ×max(0, 1− |xsi −m|)

×max(0, 1− |ysi − n|)×max(0, 1− |zsi − l|)

(2)




We choose Tθ to be a free form deformation, i.e. θ is a threedimensional vector field. For notational simplicity we definethe sampling grid function as

xsiysizsi

= Tθ(Gi) =

xtiytizti

−θxiθyiθzi

(3)

If Tθ is a free form deformation which is not in the sameresolution as the target image we are required to re-sample thedeformation field. Potentially using a different set of samplingkernels k with it’s own parameters Φ. We choose to use B-Spline interpolation to ensure smooth fields [40], utilising theCatmull-Rom solution to the interpolation problem [41].

Our method takes inspiration from STNs by using a neuralnetwork fψ(I, U), which is conditioned on both the inputimage I and the shape prior U , to produce parameters θ ofthe B-Spline deformation model Tθ. We can then perform adeformation of the prior U , calculate a segmentation loss andupdate the parameters ψ of our network.

By combining template deformation with neural networks,we mitigate the key problem with traditional template de-formation models, that being the need to hand craft a goodimage to segmentation alignment function. The source of thisproblem, as with any registration technique, lies in the fact thata loss calculation must be made during test time to update thedeformation field parameters θ. By utilising STNs to produceθ during test time and instead updating a neural network fψduring training, we can train a registration model with the truesegmentation loss function (based on alignment between priorand reference segmentation) avoiding the need for surrogatefunctions at test time.

The template deformation model is network agnostic so anyneural network can be used. We choose a simple feed forwardnetwork architecture with convolutions and max pooling toproduce a deformation field which we use in the STN todeform the prior. Full details of which are provided in Figure3 and 12.

Our method is able to take any shape prior and deformit with sub-pixel accuracy, unlike other neural network basedsegmentation algorithms which typically treat segmentation aspixel-wise classification. Since our model smoothly deformsa prior, we are able to produce partial volume segmentations,reducing discretisation artefacts in final segmentation maps.We provide experiments on both partial volume data as wellas voxel-wise classification results.

C. Field Regularisation

Due to the ill-posed nature of registration problems, it iscommon to constrain deformation fields by adding a regular-isation term to the optimisation problem that favours somedesired property, such as locally smooth deformations, or anl2 penalty on the vector field itself to favour minimum dis-placement solutions. We investigate two regularisation terms

Ll2 :=1

V

∫ X

0

∫ Y

0

∫ Z

0

T (x, y, z)2dxdydz (4)

and

Lsmooth :=1

V

∫ X

0

∫ Y

0

∫ Z

0

(∂2T

∂x2

)2

+

(∂2T

∂y2

)2

+

(∂2T

∂z2

)2

+ 2

(∂2T

∂zx

)2

+ 2

(∂2T

∂xy

)2

+ 2

(∂2T

∂yz

)2

dxdydz

(5)

Ll2 penalises the the l2-norm of the field and Lsmoothpenalises the sum of squared second order derivatives.

D. Field Aggregation

To deal with the size of the data and the memory restrictionsof modern graphics processing units we do inference on apatch basis, we collect control points across patches andaggregate them before re-sampling using B-Spline interpola-tion. This combined with only using valid padding preventsill-posed boundary conditions across the image. This alsoallows us to perform inference on variable size volumes withconsistent control point spacing without modification to theneural network.

III. ILLUSTRATIVE EXAMPLE

As a proof of concept, we present qualitative results onthe effects of corruption in the data as these are not easilyquantifiable. To investigate how incorporating a prior intoa neural network can help when corruption is present, wecreate a toy dataset of 1500 randomly deformed P’s, B’s andR’s for training and two hand crafted test images which weprovide qualitative results for. We then train a deformationmodel to deform the prior (the letter that was originallydeformed) to match the deformed letter. Additionally, we traina normal convolution neural network to predict the deformedletter on a pixel-wise level. Both the TETRIS model and theconvolutional neural network are conditioned on the prior andthe target. As is expected, when the image signal is strong,the network learns to rely heavily on the image signal andignores the prior, this can be seen in Figure 4. We trainedboth models on only uncorrupted deformations to see howeach model can handle an out-of-sample test case. We see thatthe vanilla CNN learns to completely ignore prior information,so when inferring on corrupt data, it is not able to extrapolate,unlike the TETRIS model. By restricting our model’s outputspace to be within the range of deformations of the prior we areable, even in the presence of corruption to produce plausibleresults, consistent with our prior.

IV. SYNTHETIC EXPERIMENTS

We argue that for a CNN to handle such corruption it wouldneed to be present in the training set, Figure 5 shows the effectsof having an increased amount of corrupted data in the trainingset. We construct a secondary dataset where corruption is moreeasily generated which consists of 1000 randomly deformeddiscs, where corruption is in the form of smaller discs beingcut from the main central disc and smaller peripheral discsbeing placed around the main central disc and the set is split




Corrupt Inputs

Prior Used

CNN

Fig. 4. Examples of where a deformation model can extrapolate well outsideof the distribution of the training data compared to a standard convolutionalneural network.

TABLE IQUANTITATIVE SYNTHETIC DATA RESULTS

Corruption intraining set

Dicescore

ConnectedComponents

HausdorffDistance

CNN TETRIS CNN TETRIS CNN TETRIS

0% 0.9409 0.9441 1.80 1.02 25.56 7.605% 0.9720 0.9256 4.80 1.00 22.10 7.4710% 0.9775 0.9268 3.98 1.00 23.25 6.8415% 0.9710 0.9537 2.11 1.00 20.41 5.80

in half for training and testing. We trained the models with 0%,5%, 10% and 15% of the training set consisting of corruptedexamples. As more corruption is present in the training set, thebetter the standard CNN model is able to handle them duringtest time as expected. Though the artefacts that occur are nottopologically as plausible as those produced by our TETRISmodel, which is reflected in the high dice scores but also highHausdorff distances.

Corrupt Input

Target

CNNPrior

Fig. 5. Qualitative results from varying the amount of corrupted trainingexamples in the dataset. From left to right, the models are trained with 0%,5%, 10% and 15% of the training set consisting of corrupted examples.

V. CORONARY ARTERY SEGMENTATION EXPERIMENTS

In this paper we focus on the application of vessel segmen-tation where ambiguities arise from the functional distinctionbetween veins and arteries, which may have similar imagefeatures. This has lead some methods to approach the problemas a multi stage process, first centerlines are extracted [42],

(a) Tubed prior (b) Target segmentation

Fig. 6. An example of the shape prior on the left and manual segmentationon the right. The shape prior is a tubed human annotated centreline with afixed one millimetre radius.

then the vessels are segmented [43]. Shape priors can beenforced once good candidate centerlines have been extractedby treating the segmentation task as a wall distance estimationtask. By utilising curved planar reformation [44] the segmen-tation problem can be cast as a wall distance regression fromthe centerline and topology can be guaranteed.

We train our network on a set of 274 annotated cardiac CTvolumes with 0.5 millimetre isotropic spacing and reserving138 volumes for validation and an additional 136 for testing.The ground truth labels obtained through manual expert seg-mentation are in the form of partial volumes.

A. Generating Priors for Coronary Arteries

To generate a shape prior for coronary artery segmentation,we first extract out a centerline using a semi-automatic methodwhich consists of a Random Forest voxel-wise classification,a Dijkstras shortest path based tree extraction and finally ahuman review and correction step to correct outliers. Thecenterline is converted to a 3D volumetric representation bycreating a tube around it with a fixed radius of 1 mm in a par-tial volume image, since the centerline exists in arbitrary spacerather than voxel space. More examples of coronary centerlineextraction method in cardiac CTA can be found in [42]. Figure6 is a volume rendering illustrating the difference between theprior and the ground truth segmentation in an example vessel.More generally, priors can be extracted from sources such asautomated algorithms, weak labels, human expert knowledgeor population based statistics and will inherently be applicationspecific.

B. Model Details and Baselines

We use two baseline models to compare the three models wepresent i) the residual fully convolutional network (FCN) andii) a residual U-net architecture utilising the implementationsfrom [45] using residual blocks from [46]. Details of thearchitectures can be found in Figure 3, where the buildingblocks are described in Figure 12.

We also present results on naively incorporating shape priorsinto these state-of-the-art models. We do this by feeding thenetworks two channels of input, the image to be segmentedand the prior that we have of the image at that location. This




alternative method is a very simple extension of existing state-of-the-art pixel-wise approaches, computationally cheap andeasy to implement. The shape prior, in this case, acts as akind of initialization for the network’s output.

C. Training Details

For all models we use the same patch extraction parameters,during training we dynamically extract 32 patches from eachvolume and randomly shuffle them into a buffer of 512patches. Patches are extracted if they are near the centreline,biasing the sampling around the vessel. We use a batch size of8 for all models and train them using the Adam optimiser [47]while exponentially decaying the learning rate. The learningrate at step i is as defined as

li = l0 · ris (6)

where our initial learning rate l0 = 1 · 10−5, decay rater = 0.99, decay step s = 500 and where regularization isused, we weight it by 5 · 10−6.

We pretrain our baseline models using a weighted crossentropy function as defined in Equation 7, where p is ourtarget distribution, q is our candidate distribution and w isa weighting factor. By setting w > 1, we bias the lossterm to penalise false negatives. This is beneficial as voxelscontaining the vessel interior are sparse in any given patch.Penalizing false negatives more prevents the network frompredicting all voxels as background voxels during the initialstages of optimization, a trivial local optimum. Note this doesnot need to be done with our TETRIS model as the networkis already biased towards the identity transform thanks to ourregularisation term which favours a smooth deformation field.For our experiments we set w = 2 and pretrain our non-TETRIS models for 1000 iterations.

−w · p log(q)− (1− p) log(1− q) (7)

We fine tune the models using normal, un-weighted, crossentropy for a further 5000 iterations.

D. Results

Results are presented on a test set of 136 cases, for thetask of partial volume estimation we use the cross entropyas a measure of accuracy, we can see from Table II thatincorporating shape priors into state of the to state-of-the-art neural network segmentation models significantly improvesresults. For comparison we include results on using the identityfunction on the prior, i.e. naively taking the shape prior as thesegmentation. We provide box plots of the results in Figure 9for a more fine grained break down of the results, where wehave plotted the cross entropy on a log scale. Though ourmodel with l2 field regularisations performs the best, there isno significant difference between the methods, exhibiting theexpressiveness of a deformation model despite constrainingthe output space to be a deformation of the prior.

(a) TETRIS-l2 (b) TETRIS-l2

(c) FCN (w/ prior) (d) FCN (w/ prior)

(e) U-Net (w/ prior) (f) U-Net (w/ prior)

Fig. 7. Close up of qualitative results shown as contours for the differentmethods where the blue, green, red and cyan contours are of the targetsegmentation, TETRIS, FCN (with prior) and U-Net (with prior) respectively.On the left and right are orthogonal views of the left anterior descendingartery, near the first diagonal bifurcation where TETRIS outperforms othermethods

TABLE IIQUANTITATIVE SEGMENTATION RESULTS ON TEST CASES

CrossEntropy

ConnectedComponents

DiceScore

HausdorffDistance

TrainableParameters

U-Net 0.01219 26.4 0.336 73.41 11.94MFCN 0.01190 27.0 0.406 59.94 13.74MU-Net (w/ prior) 0.00186 1.0 0.854 2.86 11.95MFCN (w/ prior) 0.00163 1.1 0.790 2.87 13.75MTETRIS-no-reg 0.00162 1.0 0.779 3.20 1.38MTETRIS-l2 0.00160 1.0 0.787 3.36 1.38MTETRIS-smooth 0.00163 1.0 0.768 3.55 1.38M

Both our U-Net (with prior) model and our proposedTETRIS model are able to consistently produce singly con-nected components without post processing by incorporatingprior information, further demonstrating the potential of ourapproachs. However the FCN (with prior) model often doesnot capture these higher order requirements, even with priorinformation, we believe this is due to the inherent multi-scalenature of the U-Net architecture. TETRIS shines when usingmetrics that take into account partial volumes, however ourCNNs with priors added as input channels work consistently




(a) TETRIS-l2 (b) TETRIS-l2

(c) FCN (w/ prior) (d) FCN (w/ prior)

(e) U-Net (w/ prior) (f) U-Net (w/ prior)

Fig. 8. Close up of qualitative results shown as contours for the differentmethods where the blue, green, red and cyan contours are of the targetsegmentation, TETRIS, FCN (with prior) and U-Net (with prior) respectively.On the left and right are orthogonal views of a trifurcation where TETRISover segments and the FCN and U-Net model under segment.

well when the goal is pixel-wise binary segmentation. We alsonote a drastic reduction of trainable parameters by a factor often for TETRIS compared to U-Net and FCN, indicating abetter balance between performance and model complexity.

To obtain the number of connected components, we thresh-old the partial volume segmentations at 0.5 and perform a26-connected component analysis. Ideally, all segmentationsshould have only one connected component. We note thatTETRIS without regularisation may result in discontinuoussegmentations, but did not find this to be the case in practice.We notice no major difference between penalising the fieldwith an l2 penalty or by the sum of second order derivatives.

Figure 10 shows an example case where neural networksare not able to recover the vessel in the image without priorinformation, where as all three our models are able to fall backon the prior when the image signal may be weak.

We further investigated the use of more complex models forTETRIS but found that convergence became slow and oftenresulted in similar validation scores, hence our choice for asimple TETRIS model. We notice that the models also havedifferent strengths and weaknesses, as mentioned previously,the U-Net (with prior) model is better than the FCN (with

FCN

U-N

et

Iden

tity

FCN

(w/

prio

r)

U-N

et(w

/pr

ior)

TE

TR

IS-n

o-re

g

TE

TR

IS-l

2

TE

TR

IS-s

moo

th

10−3

10−2

Cro

ssE

ntro

py

Fig. 9. Cross entropy for partial volume estimation of test cases for the differ-ent methods investigated, clearly demonstrating the benefits of incorporatingprior information and the ability of a deformation model to perform just aswell, if not better than an standard neural network which naively incorporatesprior information

(a) Target Segmentation (b) TETRIS-l2

(c) FCN (w/ prior) (d) U-Net (w/ prior)

(e) FCN (f) U-Net

Fig. 10. Close up of qualitative results shown as volume rendering for thedifferent methods with and without shape priors compared to the manual targetsegmentation.

prior) model at capturing global consistency of shape but inregions where contrast is low, our TETRIS model producessmoother more accurate segmentations as seen in Figure 7.

Using a deformation model does have caveats, in Figure 8we see a trifurcation region where TETRIS over-segments andboth the U-Net/FCN with prior under-segment. The resolution




of the field and the penalty applied to large deformationsprevents our model from doing well in such regions.

In summary, we should highlight the advantages of thetemplate transformer based networks over point-wise seg-mentation models such as U-net and FCN, as it might notbe apparent from the segmentation scores. Although, U-Netperforms best on Dice, TETRIS performs better on cross-entropy assessing the agreement for the soft, partial volumepredictions. Additionally, the template transformer networkscan provide guarantees on the resulting shape while both U-Net and FCN do not. This can be important in applicationswhere the segmentations are used for downstream tasks such asshape analysis or blood flow calculation. One other importantbenefit of the TETRIS model (although not explored in thiswork) is the ability to incorporate a variety of shape priorssuch as mesh-based representations or probabilistic shape andappearance models (e.g. a mean and variance image).

VI. DISCUSSION

We introduced Template Transformer Networks for imagesegmentation which are able to deform shape priors intosegmentations. This work builds on template deformations byno longer requiring the need for hand-crafted image to seg-mentation cost functions and makes use of Spatial TransformerNetworks for differentiable end-to-end learning. Our methodis competitive with state of the art segmentation algorithmswhile being able to guarantee topological constraints.

Our work is a proof of concept which relied on a simplearchitecture that can be easily extended. Though our model isrestricted in the sense that it can only perform deformationsof a prior, we argue this can be an advantage where shapeguarantees are important. Arguably, our model strikes a betterbalance between performance and model complexity due to asignificantly fewer number of trainable parameters.

We consider the prior extraction beyond the scope of thiswork, but we believe that it is a critical part of not only ourmethod, but all methods which require a prior. In problemswhere no sensible priors can used template based methodsare likely not suitable. Though not explored in this work,our approach lends itself to the incorporation of much richerpriors, such as probabilistic shape priors and other geometricrepresentations such as meshes or point distribution models.

Our method replaces an iterative method with a one-shotmethod, we believe a natural extension to investigate would beto incorporate template deformations with recurrent or auto-regressive neural networks for more flexible and potentiallylarger deformations, mitigating the effects of the chosen res-olution for the control point grid. Though in this work wechose to use B-Splines, our method is agnostic to the choiceof parameterisation of the deformation field. The explorationof other and potentially more flexible parameterisations is alsoof great interest. Additionally, we would also like to explorethe use of deformation fields which are not on fixed grids soas to allow for finer deformation fields as and when is neededwithout the burden of excess computation.

ACKNOWLEDGEMENTS

The authors would like to thank Sarah Parisot, KatarınaTothova, Ozan Oktay and Sofy Weisenberg for insightful feed-back on the manuscript as well as Vishwanath Sangale, DavidLesage, James Batten and Karl Hahn for fruitful discussions.

REFERENCES

[1] M. S. Nosrati and G. Hamarneh, “Incorporating prior knowledge inmedical image segmentation: a survey,” CoRR, vol. abs/1607.01092,2016.

[2] A. L. Yuille, D. S. Cohen, and P. W. Hallinan, “Feature extraction fromfaces using deformable templates,” Proceedings CVPR ’89: IEEE Com-puter Society Conference on Computer Vision and Pattern Recognition,pp. 104–109, 1989.

[3] G. Subsol, J.-P. Thirion, and N. Ayache, “A scheme for automaticallybuilding three-dimensional morphometric anatomical atlases: applicationto a skull atlas,” Medical image analysis, vol. 2, no. 1, pp. 37–60, 1998.

[4] T. Kapur, P. Beardsley, S. Gibson, W. Grimson, and W. Wells, “Model-based segmentation of clinical knee mri,” in Proc. IEEE Intl Workshopon Model-Based 3D Image Analysis, 1998, pp. 97–106.

[5] D. Perperidis, M. Lorenzo-Valdes, R. Chandrashekara, A. Rao, R. Mo-hiaddin, G. Sanchez-Ortiz, and D. Rueckert, “Building a 4d atlas of thecardiac anatomy and motion using mr imaging,” in Biomedical Imaging:Nano to Macro, 2004. IEEE International Symposium on. IEEE, 2004,pp. 412–415.

[6] T. Heimann and H.-P. Meinzer, “Statistical shape models for 3d medicalimage segmentation: a review,” Medical image analysis, vol. 13, no. 4,pp. 543–563, 2009.

[7] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Net-works for Biomedical Image Segmentation,” Miccai, pp. 234–241, may2015.

[8] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networksfor semantic segmentation,” in Proceedings of IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2015, pp. 3431–3440.

[9] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L.Yuille, “Deeplab: Semantic image segmentation with deep convolu-tional nets, atrous convolution, and fully connected crfs,” CoRR, vol.abs/1606.00915, 2016.

[10] K. Kamnitsas, C. Ledig, V. F. Newcombe, J. P. Simpson, A. D. Kane,D. K. Menon, D. Rueckert, and B. Glocker, “Efficient multi-scale 3dcnn with fully connected crf for accurate brain lesion segmentation,”Medical Image Analysis, vol. 36, pp. 61 – 78, 2017.

[11] F. Milletari, N. Navab, and S. Ahmadi, “V-net: Fully convolutionalneural networks for volumetric medical image segmentation,” CoRR,vol. abs/1606.04797, 2016.

[12] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du,C. Huang, and P. H. Torr, “Conditional random fields as recurrent neuralnetworks,” in Proceedings of the IEEE International Conference onComputer Vision, 2015, pp. 1529–1537.

[13] F. Milletari, A. Rothberg, J. Jia, and M. Sofka, “Integrating statisticalprior knowledge into convolutional neural networks,” in InternationalConference on Medical Image Computing and Computer-Assisted Inter-vention. Springer, 2017, pp. 161–168.

[14] O. Oktay, E. Ferrante, K. Kamnitsas, M. Heinrich, W. Bai, J. Caballero,S. A. Cook, A. de Marvao, T. Dawes, D. P. ORegan, B. Kainz,B. Glocker, and D. Rueckert, “Anatomically constrained neural networks(acnns): Application to cardiac image enhancement and segmentation,”IEEE Transactions on Medical Imaging, vol. 37, no. 2, pp. 384–395,Feb 2018.

[15] J. Hu, J. Lu, Y. P. Tan, and J. Zhou, “Deep Transfer Metric Learning,”IEEE Transactions on Image Processing, vol. 25, no. 12, pp. 5576–5588,2016.

[16] J. E. Iglesias and M. R. Sabuncu, “Multi-atlas segmentation of biomed-ical images: a survey,” Medical image analysis, vol. 24, no. 1, pp. 205–219, 2015.

[17] E. Dagostino, F. Maes, D. Vandermeulen, and P. Suetens, “An infor-mation theoretic approach for non-rigid image registration using voxelclass probabilities,” Medical image analysis, vol. 10, no. 3, pp. 413–431,2006.




[18] W. Bai, W. Shi, D. P. O’Regan, T. Tong, H. Wang, S. Jamil-Copley, N. S.Peters, and D. Rueckert, “A probabilistic patch-based label fusion modelfor multi-atlas segmentation with registration refinement: application tocardiac mr images,” IEEE transactions on medical imaging, vol. 32,no. 7, pp. 1302–1315, 2013.

[19] K. A. Saddi, C. Chefd’Hotel, M. Rousson, and F. Cheriet, “Region-basedsegmentation via non-rigid template matching,” in Computer Vision,2007. ICCV 2007. IEEE 11th International Conference on. IEEE,2007, pp. 1–7.

[20] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham, “Activeshape models-their training and application,” Computer vision and imageunderstanding, vol. 61, no. 1, pp. 38–59, 1995.

[21] B. Flury, Multivariate statistics: a practical approach. Chapman &Hall, Ltd., 1988.

[22] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearancemodels,” IEEE Transactions on Pattern Analysis & Machine Intelligence,no. 6, pp. 681–685, 2001.

[23] H. Uzunova, M. Wilms, H. Handels, and J. Ehrhardt, “Training cnns forimage registration from few samples with model-based data augmenta-tion,” in Medical Image Computing and Computer Assisted Intervention- MICCAI 2017, M. Descoteaux, L. Maier-Hein, A. Franz, P. Jannin,D. L. Collins, and S. Duchesne, Eds. Cham: Springer InternationalPublishing, 2017, pp. 223–231.

[24] A. Dosovitskiy, P. Fischery, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov,P. V. D. Smagt, D. Cremers, and T. Brox, “FlowNet: Learning opticalflow with convolutional networks,” in Proceedings of the IEEE Inter-national Conference on Computer Vision, vol. 2015 Inter, apr 2015, pp.2758–2766.

[25] H. Sokooti, B. de Vos, F. Berendsen, B. P. F. Lelieveldt, I. Isgum, andM. Staring, “Nonrigid image registration using multi-scale 3d convo-lutional neural networks,” in Medical Image Computing and ComputerAssisted Intervention - MICCAI 2017, M. Descoteaux, L. Maier-Hein,A. Franz, P. Jannin, D. L. Collins, and S. Duchesne, Eds. Cham:Springer International Publishing, 2017, pp. 232–239.

[26] X. Cao, J. Yang, J. Zhang, D. Nie, M. Kim, Q. Wang, and D. Shen,“Deformable image registration based on similarity-steered cnn regres-sion,” in Medical Image Computing and Computer Assisted Intervention- MICCAI 2017, M. Descoteaux, L. Maier-Hein, A. Franz, P. Jannin,D. L. Collins, and S. Duchesne, Eds. Cham: Springer InternationalPublishing, 2017, pp. 300–308.

[27] M.-M. Rohe, M. Datar, T. Heimann, M. Sermesant, and X. Pennec, “Svf-net: Learning deformable image registration using shape matching,”in Medical Image Computing and Computer Assisted Intervention -MICCAI 2017, M. Descoteaux, L. Maier-Hein, A. Franz, P. Jannin,D. L. Collins, and S. Duchesne, Eds. Cham: Springer InternationalPublishing, 2017, pp. 266–274.

[28] X. Yang, R. Kwitt, M. Styner, and M. Niethammer, “Quicksilver: Fastpredictive image registration A deep learning approach,” NeuroImage,vol. 158, pp. 378–396, 2017.

[29] M. F. Beg, M. I. Miller, A. Trouve, and L. Younes, “Computing largedeformation metric mappings via geodesic flows of diffeomorphisms,”International journal of computer vision, vol. 61, no. 2, pp. 139–157,2005.

[30] K. Ma, J. Wang, V. Singh, B. Tamersoy, Y.-J. Chang, A. Wimmer,and T. Chen, “Multimodal image registration with deep context re-inforcement learning,” in International Conference on Medical ImageComputing and Computer-Assisted Intervention. Springer, 2017, pp.240–248.

[31] B. D. de Vos, F. F. Berendsen, M. A. Viergever, M. Staring, andI. Isgum, “End-to-end unsupervised deformable image registration witha convolutional neural network,” CoRR, vol. abs/1704.06065, 2017.

[32] S. Shan, X. Guo, W. Yan, E. I. Chang, Y. Fan, and Y. Xu, “Unsupervisedend-to-end learning for deformable medical image registration,” CoRR,vol. abs/1711.08608, 2017.

[33] G. Balakrishnan, A. Zhao, M. R. Sabuncu, J. Guttag, and A. V.Dalca, “An unsupervised learning model for deformable medical imageregistration,” in Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 2018, pp. 9252–9260.

[34] H. Zhang and X. He, “Deep free-form deformation network for object-mask registration,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 2017, pp. 4251–4259.

[35] C. Qin, W. Bai, J. Schlemper, S. E. Petersen, S. K. Piechnik,S. Neubauer, and D. Rueckert, “Joint learning of motion estimationand segmentation for cardiac mr image sequences,” arXiv preprintarXiv:1806.04066, 2018.

[36] P. Luc, C. Couprie, S. Chintala, and J. Verbeek, “Semantic segmentationusing adversarial networks,” arXiv preprint arXiv:1611.08408, 2016.

[37] M. Rajchl, M. C. Lee, O. Oktay, K. Kamnitsas, J. Passerat-Palmbach,W. Bai, M. Damodaram, M. A. Rutherford, J. V. Hajnal, B. Kainz et al.,“Deepcut: Object segmentation from bounding box annotations usingconvolutional neural networks,” IEEE transactions on medical imaging,vol. 36, no. 2, pp. 674–683, 2017.

[38] C. Rother, V. Kolmogorov, and A. Blake, “Grabcut: Interactive fore-ground extraction using iterated graph cuts,” in ACM transactions ongraphics (TOG), vol. 23, no. 3. ACM, 2004, pp. 309–314.

[39] K. Tothova, S. Parisot, M. C. Lee, E. Puyol-Anton, L. M. Koch,A. P. King, E. Konukoglu, and M. Pollefeys, “Uncertainty quantificationin cnn-based surface prediction using shape priors,” arXiv preprintarXiv:1807.11272, 2018.

[40] D. Rueckert, P. Aljabar, R. A. Heckemann, J. V. Hajnal, and A. Ham-mers, “Diffeomorphic registration using B-splines.” Medical imagecomputing and computer-assisted intervention : MICCAI ... Interna-tional Conference on Medical Image Computing and Computer-AssistedIntervention, vol. 9, no. Pt 2, pp. 702–9, 2006.

[41] E. Catmull and R. Rom, “A class of local interpolating splines,” inComputer aided geometric design. Elsevier, 1974, pp. 317–326.

[42] M. Schaap, C. T. Metz, T. van Walsum, A. G. van der Giessen, A. C.Weustink, N. R. Mollet, C. Bauer, H. Bogunovic, C. Castro, X. Denget al., “Standardized evaluation methodology and reference databasefor evaluating coronary artery centerline extraction algorithms,” Medicalimage analysis, vol. 13, no. 5, pp. 701–714, 2009.

[43] Kirisli, H. A. and Schaap, M. and Metz, C. T. and Dharampal, A. S. andMeijboom, W. B. and Papadopoulou, S. L. and Dedic, A. and Nieman,K. and De Graaf, M. A. and Meijs, M. F. L. and others, “Standardizedevaluation framework for evaluating coronary artery stenosis detection,stenosis quantification and lumen segmentation algorithms in computedtomography angiography,” Medical image analysis, vol. 17, no. 8, pp.859–876, 2013.

[44] A. Kanitsar, D. Fleischmann, R. Wegenkittl, P. Felkel, and M. E. Groller,“Cpr: curved planar reformation,” in Proceedings of the conference onVisualization’02. IEEE Computer Society, 2002, pp. 37–44.

[45] N. Pawlowski, S. I. Ktena, M. C. Lee, B. Kainz, D. Rueckert, B. Glocker,and M. Rajchl, “Dltk: State of the art reference implementations for deeplearning on medical images,” arXiv preprint arXiv:1711.06853, 2017.

[46] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” CoRR, vol. abs/1512.03385, 2015.

[47] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014.

APPENDIX

We provide full examples of vessel segmentations in Figure11 to give the reader larger context into the accuracy ofthe model, where we have trimmed the aorta for clearervisualisations. Without prior information it is clear that thesensitivity of the networks drop substantially.




(a) Target Segmentation (b) TETRIS-l2

(c) FCN (w/ prior) (d) U-Net (w/ prior)

(e) FCN (f) U-Net

Fig. 11. Qualitative results shown as volume rendering for the differentmethods with and without shape priors compared to the manual targetsegmentation. The transfer function from partial volume probability to opacityis set as the identity function from [0, 1] → [0, 1]

residual unit(n, k, s)

Batch Normalisation

Activation Function

Batch Normalisation

convolution kernels of size and stride

n

k3

s3

Activation Function

Input

Add

convolution kernels of size and stride

n

k3

13

Max pooling with kernel size s3

bottle neckresidual unit

(n, k, s)

Batch Normalisation

Activation Function

Batch Normalisation

convolution kernels ofsize and stride

n

k3

s3

Activation Function

Input

Add


n

k3 13

Max pooling with kernelsize s3


n

13 13

Batch Normalisation

Activation Function

residual matchingunit

(n)

Batch Normalisation


n

13 13

Input 2

Add

Input 1


n

13 13


Fig. 12. Graphical representation of the building blocks used to construct allmodels.

tetris: template transformer networks for image ... · the concept of template transformer networks...

Documents