quality assurance using outlier detection for …

101
QUALITY ASSURANCE USING OUTLIER DETECTION FOR AUTOMATIC SEGMENTATION OF CEREBELLAR PEDUNCLES by Ke Li A thesis submitted to Johns Hopkins University in conformity with the requirements for the degree of Master of Science in Engineering Baltimore, Maryland August, 2015 © 2015 Ke Li All Rights Reserved

Upload: others

Post on 31-Jan-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

QUALITY ASSURANCE USING OUTLIER DETECTION FOR AUTOMATIC SEGMENTATION OF CEREBELLAR

PEDUNCLES

by

Ke Li

A thesis submitted to Johns Hopkins University in conformity with the requirements for the degree of Master of Science in Engineering

Baltimore, Maryland

August, 2015

© 2015 Ke Li All Rights Reserved

ii

Abstract

Cerebellar peduncles (CPs) are white matter tracts that connect the cerebellum to

other brain regions. Automatic segmentation and quantification methods of the CPs are

important for objectively and efficiently studying their structure and function. Usually the

performance of automatic segmentation methods is evaluated by comparing with manual

delineations (ground truth). However, while this approach characterizes the performance

in an average sense, when a segmentation method is run on new data (for which no

ground truth exists) it is highly desirable to be able to efficiently detect and assess

algorithm failures so that these cases can be excluded from scientific analysis or rerun

with different parameters.

This thesis focuses on better understanding the performance of an automatic CP

segmentation method using two kinds of outlier detection methods. One is a simple

univariate non-parametric method using box-whisker plots. The other is a supervised

classification method. The content of this thesis is divided into three parts. First, a new

segmentation pipeline and its validation are described. The validation is performed by

two statistical tests with respect to two segmentation quality metrics. Results show that

segmentation labels from the new pipeline are statistically the same as those from the old

pipelines and the new pipeline performs even better on segmenting the decussation of the

superior cerebellar peduncles (dSCPs).

In the second part of this thesis, the univariate outlier detection method using box-

whisker plots is described. Automatic segmentation labels of a dataset with 48 subjects

iii

were manually categorized as successful segmentations or segmentation failures. Three

kinds of features were extracted from the categorized failures and then used for failure

detection. Performances of these features were quantitatively compared.

In the third part of this thesis, both box-whisker plots and the supervised

classification method applied to two datasets with a total of 249 manually categorized (as

success or failure) automatic segmentation labels are described. Four classifiers—linear

discriminant analysis (LDA), logistic regression (LR), support vector machine (SVM),

and random forest classifier (RFC) were used for failure detection. Each classifier’s

performance was evaluated using a leave-one-out cross-validation. Results show that the

performances among the LDA, SVM and RFC are not very different and LR performs

worse than the other three classifiers.

This thesis is prepared under the direction of Dr. Jerry L. Prince. The other two

readers are Dr. Bruno M. Jedynak and Dr. Sarah H. Ying.

iv

Acknowledgements

I would like to express my sincere appreciation to my research advisor, Dr. Jerry

L. Prince, for his guidance, consistent encouragement and support, and many useful

discussions during this research project. I really appreciate that Dr. Prince spent a lot of

time on reviewing and correcting my thesis. I would also like to thank Dr. Bruno M.

Jedynak for his suggestion of how to identify outliers easily and Dr. Sarah H. Ying for

her helpful comments on revising my thesis. Furthermore, I would like to thank the

members of the Image Analysis and Communications Lab for their kind help in the

research, particularly Zhen Yang, Dr. Chuyang Ye, Jeff Glaister, Amod Jog, Aaron

Carass, and Dr. Min Chen. Last, I want to thank my family and friends for their support

and encouragement.

v

Table of Contents

Abstract .............................................................................................................................. ii  

Table of Contents .............................................................................................................. v  

List of Tables ................................................................................................................... vii  

List of Figures ................................................................................................................... ix  

Chapter 1 Introduction ..................................................................................................... 1  

1.1   Thesis Contributions  ................................................................................................................  5  

1.2   Thesis Organization  ..................................................................................................................  6  

Chapter 2 Background ..................................................................................................... 9  

2.1   Automatic Segmentation Method of Cerebellar Peduncles  .............................................  9  

2.2   Quality Assurance in Medical Imaging Field  ..................................................................  12  

2.3   Outlier Detection Methodologies  ........................................................................................  15  

Chapter 3 Validation of the New Algorithm Pipeline .................................................. 20  

3.1   Algorithm Pipelines  ...............................................................................................................  21  

3.1.1   CATNAP  ...........................................................................................................................................  21  

3.1.2   The segmentation pipelines  .........................................................................................................  21  

3.2   Comparison of Segmentation labels  ...................................................................................  25  

3.2.1   Description of the Tomacco dataset  ..........................................................................................  25  

3.2.2   Comparison results  .........................................................................................................................  26  

3.3   Statistical Tests  .......................................................................................................................  29  

vi

3.3.1   Tests on the Dice coefficients  .....................................................................................................  29  

3.3.2   Tests on the average surface distances (ASDs)  .....................................................................  32  

3.3.3   Conclusion  .........................................................................................................................................  33  

Chapter 4 Outlier Detection on the Tomacco Dataset ................................................. 34  

4.1   Categorization of Automatic Segmentations  ....................................................................  34  

4.2   Feature Extraction  .................................................................................................................  38  

4.3   Outlier Detection Results  .....................................................................................................  43  

Chapter 5 Outlier Detection on the Kwyjibo and Tomacco Datasets ........................ 50  

5.1   Categorization of Automatic Segmentations  ....................................................................  52  

5.2   Outlier Detection Results  .....................................................................................................  56  

5.3   Verification of the CATNAP-v2 Algorithm  .....................................................................  63  

5.4   Reproduction of the Tomacco Dataset  ..............................................................................  67  

5.5   Outlier Detection using Classification Methods  ..............................................................  74  

5.5.1   The four classifiers  .........................................................................................................................  74  

5.5.2   Training sets  ......................................................................................................................................  75  

5.5.3   Performance evaluation  .................................................................................................................  76  

Chapter 6 Conclusions and Future Work .................................................................... 79  

6.1   Main Contributions  ...............................................................................................................  79  

6.2   Future Work  ...........................................................................................................................  81  

Bibliography .................................................................................................................... 83  

Vita ................................................................................................................................... 89  

vii

List of Tables

Table 3.1 The Dice coefficients between the manual delineations and the segmentation

labels in the two RFC processes in the RFC+MGDM and CPSeg pipelines,

respectively. .............................................................................................................. 30  

Table 3.2 The Dice coefficients between the manual delineations and the final

segmentation labels in the two MGDM processes in the RFC+MGDM and CPSeg

pipelines, respectively. .............................................................................................. 30  

Table 3.3 The p -values of the paired Student's t -test and the Wilcoxon signed-rank test

for comparing the Dice coefficients between the RFC and RFC+MGDM results and

the results from CPSeg. ............................................................................................. 31  

Table 3.4 The ASDs between the manual delineations and the segmentation labels in the

two RFC processes in the RFC+MGDM and CPSeg pipelines, respectively. .......... 32  

Table 3.5 The ASDs between the manual delineations and the final segmentation labels

in the two MGDM processes in the RFC+MGDM and CPSeg pipelines,

respectively. .............................................................................................................. 32  

Table 3.6 The p -values of the paired Student's t -test and the Wilcoxon signed-rank test

for comparing the ASDs between the RFC and RFC+MGDM results and the results

from CPSeg. .............................................................................................................. 33  

Table 4.1 Information of the 48 subjects in the Tomacco dataset including ID, diagnoses,

gender, categories, and scores. .................................................................................. 37  

Table 4.2 Outlier detection results by selected features of the Tomacco dataset. The top

viii

nine subjects were manually categorized as segmentation failures in this dataset. .. 49  

Table 5.1 Information of the categorized segmentation failures and imperfect but

successful segmentations in the Kwyjibo dataset including ID, diagnoses, gender,

categories, and scores. ............................................................................................... 54  

Table 5.2 Outlier detection results by selected features on the Kwyjibo dataset. The top

eight subjects were manually categorized as segmentation failures in this dataset. . 62  

Table 5.3 Information of the 46 subjects in the Tomacco dataset including ID, diagnoses,

gender, categories, and scores. .................................................................................. 68  

Table 5.4 Outlier detection results by selected features of the reprocessed Tomacco

dataset. Four subjects were manually categorized as segmentation failures in this

dataset. ...................................................................................................................... 73  

Table 5.5 Performance comparison of the four classifiers (LDA, LR, SVM, and RFC) on

the combined Tomacco and Kwyjibo dataset. .......................................................... 78  

ix

List of Figures

Figure 1.1 Cerebellar peduncles (SCP, MCP, and ICP) shown with the cerebellum (gray)

and the brainstem (purple) (Chuyang Ye, Yang, Ying, & Prince, 2015). .................. 1  

Figure 3.1 The old segmentation pipeline: (a) is the RFC process and (b) is the MGDM

process. ...................................................................................................................... 22  

Figure 3.2 The new integrated segmentation pipeline: cerebellar peduncle segmentation

(CPSeg). .................................................................................................................... 24  

Figure 3.3 Comparison of the three outputs of the two RFC processes in the CPSeg and

the RFC+MGDM pipelines. (a) The upper row includes (left to right) the

segmentation labels, the brain mask, and the membership from the RF WM

Initialization module, which is the RFC process in the CPSeg. The middle row

includes these three results from the old RFC pipeline. The bottom row includes the

subtractions of the three results from CPSeg and RFC, respectively. (b) The same

images as shown in (a) except for that the direct input paramters to the two RF WM

Initialization modules in the CPSeg and the RFC+MGDM pipelines are the same. 27  

Figure 4.1 (a) A PEV edge map overlaid with a corresponding automatic segmentation

(yellow–MCP, dark blue–lSCP, light blue–rSCP, orange–lICP, red–rICP). (b) A

linear Westin index overlaid with a corresponding automatic segmentation. .......... 35  

Figure 4.2 Nine segmentation failures in the Tomacco dataset: (a) is a successful

segmentation of a subject with ID at1000 as a reference. (b)–(j) are nine

segmentation failures of subjects with IDs: at1034, at1049, at1083, at1007, at1002,

x

at1016, at1046, at1078, at1103, respectively. ........................................................... 40  

Figure 4.3 Six imperfect segmentations in the Tomacco dataset: (a)–(f) are image slices

of segmentation results from the subjects with IDs: at1032, at1056, at1041, at1080,

at1081, and at1086, respectively. (a), (b), (d), and (e): a small portion of MCPs are

cut off. (c) and (f): a small portion of lICP is cut off. ............................................... 41  

Figure 4.4 Volumes of six CPs of manual delineations of 10 Tomacco datasets (red

boxes) and automatic segmentations of the 48 Tomaaco datasets (blue boxes),

respectively. .............................................................................................................. 46  

Figure 4.5 Surface areas of six CPs of manual delineations of 10 Tomacco datasets (red

boxes) and automatic segmentations of the 48 Tomaaco datasets (blue boxes),

respectively. .............................................................................................................. 46  

Figure 4.6 Means and SDs of FA and MD of the whole brains of the 48 Tomacco

datasets. ..................................................................................................................... 47  

Figure 4.7 Means and SDs of the three Westin indices of the whole brains of the 48

Tomacco datasets. ..................................................................................................... 47  

Figure 4.8 Brain mask features: volumes of the left, right, and whole brain masks (the

left 3 boxplots) and the symmetry of the brain masks (the rightest boxplot) of the 48

Tomacco datasets. ..................................................................................................... 48  

Figure 5.1 Eight segmentation failures in the Kwyjibo dataset: (a) AT1275 with scan

time 04/02/2007, rICP is missing. (b) AT1532 with scan time 12/02/2009, dSCP is

missing. (c) AT1569 with scan time 10/01/2010, dSCP is missing. (d) AT1594 with

scan time 11/29/2012, dSCP is missing. (e) AT1061 with scan time 08/04/2008, a

large portion of MCP (yellow) is cut off. (f) AT1061 with scan time 03/06/2009, a

xi

large portion of MCP (yellow) is cut off. (g) AT1219 with scan time 07/30/2010,

failed to segment MCP and SCPs correctly. (h) AT1556 with scan time 08/02/2011,

rSCP is missing and MCP and lSCP are not correctly segmented. .......................... 55  

Figure 5.2 Volumes of the six CPs of the 48 segmentations in the Tomacco dataset (red

boxes) and the 203 segmentations in the Kwyjibo dataset (blue boxes), respectively.

................................................................................................................................... 59  

Figure 5.3 Surface areas of the six CPs of the 48 segmentations in the Tomacco dataset

(red boxes) and the 203 segmentations in the Kwyjibo dataset (blue boxes),

respectively. .............................................................................................................. 59  

Figure 5.4 Means and SDs of FA and MD of the whole brains of the 48 Tomacco

datasets (red boxes) and the 203 Kwyjibo datasets (blue boxes), respectively. ....... 60  

Figure 5.5 Means and SDs of the three Westin indices of the whole brains of the 48

Tomacco datasets (red boxes) and the 203 Kwyjibo datasets (blue boxes),

respectively. .............................................................................................................. 60  

Figure 5.6 Brain mask features: volumes of the left, right, and whole brain masks and the

symmetry of the brain masks of the 48 Tomacco datasets (red boxes) and the 203

Kwyjibo datasets (blue boxes), respectively. ............................................................ 61  

Figure 5.7 Volume differences of six CPs of 30 Tomacco segmentations using

CATNAP-v2 and CATNAP-v1, respectively. The results are obtained by subtracting

the volumes of CPs using CATNAP-v1 from those using CATNAP-v2. ................ 65  

Figure 5.8 Histograms of the volume differences of the six CPs of 30 segmentations

using CATNAP-v2 and CATNAP-v1, respectively. The results are obtained by

subtracting the volumes of CPs using CATNAP-v1 from those using CATNAP-v2.

xii

................................................................................................................................... 66  

Figure 5.9 (a) The registered segmentation result of at1020 using CATNAP-v1; (b) The

segmentation result of at1020 using CATNAP-v2; and (c) their subtraction. (d)The

MPRAGE image of at1020 and (e) the subtracted image overlaid with the MPRAGE

image. ........................................................................................................................ 66  

Figure 5.10 Volumes of the six CPs of 46 reprocessed Tomacco datasets (red boxes) and

the 203 Kwyjibo datasets (blue boxes), respectively. ............................................... 70  

Figure 5.11 Surface areas of the six CPs of the 46 reprocessed Tomacco datasets (red

boxes) and the 203 Kwyjibo datasets (blue boxes), respectively. ............................ 70  

Figure 5.12 Means and SDs of FA and MD of the whole brains of the 46 reprocessed

Tomacco datasets (red boxes) and the 203 Kwyjibo datasets (blue boxes),

respectively. .............................................................................................................. 71  

Figure 5.13 Means and SDs of the three Westin indices of the whole brains of the 46

reprocessed Tomacco datasets (red boxes) and the 203 Kwyjibo datasets (blue

boxes), respectively. .................................................................................................. 71  

Figure 5.14 Brain mask features: volumes of the left, right, and whole brain masks and

the symmetry of the brain masks of the 46 reprocessed segmentations in the

Tomacco dataset (red boxes) and the 203 segmentations in the Kwyjibo dataset (blue

boxes), respectively. .................................................................................................. 72  

1

Chapter 1 Introduction

The cerebellar peduncles, which carry the inputs and outputs of the cerebellum,

are major white matter tracts connecting the cerebellum and other brain parts, including

the cerebral cortex and the spinal cord (Sivaswamy et al., 2010). They consist of superior

cerebellar peduncles (SCPs), the middle cerebellar peduncle (MCP), and the inferior

cerebellar peduncles (ICPs), as shown in Figure 1.1. Automatic segmentation and

quantification of the cerebellar peduncles is necessary for studying their structure and

function objectively and efficiently. Fortunately, diffusion tensor imaging (DTI) (Le

Bihan et al., 2001), which can characterize water diffusion magnitude and anisotropy

noninvasively, has made this goal achievable. However, while algorithms for

automatically segmenting the cerebellar peduncles based on DTI have been proposed,

none of the existing methods adequately segment the decussation of the SCPs (dSCP), the

region where the SCPs cross.

! Figure 1.1 Cerebellar peduncles (SCP, MCP, and ICP) shown with the cerebellum (gray)

and the brainstem (purple) (Chuyang Ye, Yang, Ying, & Prince, 2015).

2

To solve this problem, an automatic method to volumetrically segment the

cerebellar peduncles including the dSCP is proposed by Chuyang Ye et al.(Chuyang Ye

et al., 2015). This method consists of a random forest classifier (RFC) and a multi-object

geometric deformable model (MGDM). The random forest classifier uses features

extracted from the DTI scans to provide an initial segmentation of the peduncles. MGDM

is then used to refine the random forest classification, leading to smoother and more

accurate results. This method was evaluated using a leave-one-out cross-validation on

five control subjects and four patients with spinocerebellar ataxia type 6 (SCA6). Results

on these nine subjects indicate that the method is able to resolve the dSCPs and

accurately segment the cerebellar peduncles.

The focus of this thesis is on gaining a better understanding of the segmentation

performance of this CP segmentation method. This is important since the method will be

used on a much larger data set for scientific analysis. Usually performance evaluation of

automatic medical image segmentation methods is conducted by comparing the

segmentations with manual delineations (ground truth). However, while this approach

characterizes the performance in an average sense, when the method is run on new data

(for which no ground truth exists) it is highly desirable to be able to assess algorithm

failures so that these cases can be excluded from analysis or rerun with different

parameters. Considering the huge size of data and heavy workload of visual inspection,

finding a way to automatically and accurately detect algorithm failures is demanding.

In this thesis, we propose to carry out quality assurance for this automatic

segmentation method using outlier detection (Hodge & Austin, 2004). There is no

universally accepted definition for an outlier, but we will take the definition of Grubbs

3

(Grubbs, 1969) which stated that an outlying observation, or outlier, is one that appears to

deviate markedly from other members of the sample in which it occurs. Outlier detection

is a critical task in many safety critical environments as the outlier indicates abnormal

conditions from which significant performance degradation may result. Since outliers

arise because of many reasons such as human error, instrument error, natural deviation in

populations etc., how the outlier detection method detects and deals with the outlier

depends on the application area. Though outlier detection techniques have been applied

in areas such as fraud detection, activity monitoring, network performance, detecting

novelties in images etc., there is little research on detecting medical image segmentation

failures, and there is no work (of which we are aware) for the specific problem of the

automatic segmentation method of cerebellar peduncles presented in the paper (Ye et. al.

2015).

This thesis focuses on better understanding the performance of this CP

segmentation method using two outlier detection methods for quality assurance. One is a

simple univariate non-parametric method using box-whisker plots. The other is a

supervised classification method. Before outlier detection study, we first validated the

new segmentation pipeline used in this thesis. Then the univariate outlier detection

method using box-whisker plots is described. Automatic segmentation labels of a dataset

with 48 subjects were manually categorized as successful segmentations or segmentation

failures. Three kinds of features were extracted from the categorized failures and used for

failure detection. Next we applied both box-whisker plots and the supervised

classification method to a combined dataset with a total of 249 manually categorized (as

success or failure) automatic segmentation results. Four classifiers—linear discriminant

4

analysis (LDA), logistic regression (LR), support vector machine (SVM), and random

forest classifier (RFC) were used for failure detection in this combined dataset. Each

classifier’s performance was evaluated using a leave-one-out cross-validation. Results

show that the performances among the LDA, SVM and RFC are not very different and

LR performs worse than the other three classifiers.

In this chapter, the main contributions and the thesis organization are described.

5

1.1 Thesis Contributions

Three main contributions are made in this thesis:

1. Quantitative validation of a new segmentation pipeline: We validated a

new integrated cerebellar peduncle segmentation pipeline, CPSeg, with the

old separate pipelines, RFC and MGDM, which were used in the original

paper reporting the peduncle segmentation algorithm (Chuyang Ye et al.,

2015). Dice coefficients (Dice, 1945) and average surface distances (ASDs)

between nine segmentation results and corresponding manual delineations

were computed on both pipelines. Statistical tests show that segmentation

results from this integrated segmentation pipeline CPSeg is not statistically

different from those using the old separate pipelines. The CPSeg performs

even better on segmenting the dSCPs.

2. Verification of a preprocessing pipeline: We verified a preprocessing

pipeline, CATNAP, against a slightly different version of this pipeline. We

call the old version CATNAP-v1 and the new version CATNAP-v2. This

verification was necessary since two of our datasets were processed using

different CATNAP versions, but we must merge these data in order to study

them together. We conducted both quantitative and visual inspection of the

segmentation results from the two CATNAP pipelines using the same inputs.

Results show that CATNAP-v2 generates statistically different volumes of the

MCPs. Visual inspection of the segmentation results shows that CATNAP-v2

performs better than CATNAP-v1. Thus, we chose to use the CATNAP-v2

6

pipeline to process datasets for a further outlier detection study.

3. Outlier detection using box-whisker plots and supervised classification

methods: First, we manually categorized the segmentation results on two

datasets as either a successful segmentation or a segmentation failure. We

designed features based on the categorized segmentation failures. Then we

detected outliers based on these computed features using box-whisker plots.

We also used supervised classification methods for outlier detection. With

manually categorized segmentation results as training data, we applied four

classifiers —linear discriminant analysis (LDA), logistic regression (LR),

support vector machine (SVM), and random forest classifier (RFC) for

automatic failure detection. We evaluated the performance of each classifier

using a leave-one-out cross-validation and computed the true positive and

false positive rates on each classifier. Our results show that the performances

of the LDA, the linear SVM and the RFC are not very different and the LR

performs worse than the other three classifiers.

1.2 Thesis Organization

This thesis is organized as follows. In Chapter 2, we provide some background

information. Since our goal is to do quality assurance for the automatic segmentation

method of cerebellar peduncles, we briefly review some automatic segmentation methods

of white matter tracts. A brief overview of quality assurance of medical image

segmentation algorithm is given. Since we use outlier detection for quality assurance, a

brief literature review of methodologies of outlier detection is also provided.

7

In Chapter 3, we present quantitative validation of the integrated new

segmentation pipeline, namely the CPSeg. We first introduce the old segmentation

pipeline consisting of two separate pipelines (RFC and MGDM) and the CPSeg. Then,

we compare the two pipelines and show the differences between their segmentation

results. Next, for both pipelines, we report the computed Dice coefficients and average

surface distances (ASDs) between the segmentation results and manual delineations. We

used a Paired student’s t -test and a Wilcoxon signed-rank test to statistically compare

the Dice coefficients and ASDs; results show that the new segmentation pipeline is not

statistically different from the old one.

Chapter 4 presents results on the use of a simple outlier detection method applied

to a dataset including 48 subjects with both healthy controls and subjects with ataxia. The

segmentation results in this dataset were manually categorized as successful

segmentations or segmentation failures. We then computed several statistics and

evaluated informally whether these features seemed capable of identifying the poor

segmentation results. Lastly, we detect outliers in this dataset by selected features and

evaluated the performance of each feature.

In the research reported in Chapter 5, we conduct outlier detection on two datasets

using both box-whisker plots and supervised classification methods to study the

performance of the automatic segmentation algorithm. We first studied the segmentation

algorithm’s performance on a data set with 203 subjects including both healthy controls

and patients with ataxia. We manually categorized these segmentation labels as

successful segmentations or segmentation failures and detected outliers using boxplots.

We found that the distributions of some features in the the Kwyjibo dataset are

8

significantly different from those in the Tomacco dataset and features deemed effective in

the Tomacco dataset are not all able to detect outliers in the Kwyjibo dataset. Since the

Kwyjibo dataset was processed using an updated preprocessing pipeline CATNAP-v2

with the CPSeg while Tomacco dataset were processed using CATNAP-v1 with the

CPSeg, we reprocessed the Tomacco dataset using the CATNAP-v2. Before that, we

verified the CATNAP-v2. Last, we combined the reprocessed Tomacco and Kwyjibo

datasets using the CATNAP-v2 and the CPSeg. We trained four classifiers—linear

discriminant analysis (LDA), logistic regression (LR), support vector machine (SVM),

and random forest classifier (RFC) for automatic failure detection and evaluated the

performance of each classifier using a leave-one-out cross-validation. We also computed

the true positive and false positive rates of each classifier. Results show that the

performances among the LDA, SVM and RFC are not significantly different and LR

performs worse than the other three classifiers.

In the final chapter, we summarize the main contributions and conclusions of this

thesis. We also highlight some future work about this project.

9

Chapter 2 Background

The target of this thesis is quality assurance for automatic segmentation algorithm

of cerebellar peduncles developed by Ye et al. (Chuyang Ye et al., 2015). The general

background and theory of this algorithm therefore is given first. Then a brief overview of

quality assurance methods in medical image analysis is presented. Lastly, we introduce

methodologies for outlier detection.

2.1 Automatic Segmentation Method of Cerebellar Peduncles

The cerebellum has three peduncles: the superior cerebellar peduncles (SCPs), the

middle cerebellar peduncles (MCPs), and the inferior cerebellar peduncles (ICPs). The

SCPs consists mainly of efferent fibers from the cerebellum to the thalamus and red

nucleus. The left and right SCPs cross each other in a region called decussation of the

SCP (dSCP) in the midbrain. The fibers then head toward the red nuclei on the opposite

side, where some fibers terminate but most continue to the thalamus (Perrini, Tiezzi,

Castagna, & Vannozzi, 2013).The MCPs consists of centripetal fibers, connecting the

cerebellum to the pons. The ICPs primarily contain afferent fibers from the medulla, as

well as efferent fibers to the vestibular nuclei (Mori, Wakana, Van Zijl, & Nagae-

Poetscher, 2005).

Cerebellar peduncles can be affected by neurological diseases including

spinocerebellar ataxia (Murata et al., 1998; Ying et al., 2009), Wilson disease (King et

al., 1996; Magalhaes et al., 1994), schizophrenia (F. Wang et al., 2003), and multiple

10

system atrophy (Nicoletti et al., 2006). Most studies on the atrophy of cerebellar

peduncles are conducted using manual delineations, which can be time-consuming and

biased. Therefore, automatic segmentation methods of cerebellar peduncles are needed

for further studies on large dataset.

With the development of diffusion tensor imaging (DTI) (Le Bihan et al., 2001),

automatic segmentation methods of white matter tracts were also proposed (Bazin et al.,

2011; Hao, Zygmunt, Whitaker, & Fletcher, 2014; Lawes et al., 2008; Mai, Goebl, &

Plant, 2012; Mayer, Zimmerman-Moreno, Shadmi, Batikoff, & Greenspan, 2011;

Chuyang Ye, Bazin, Bogovic, Ying, & Prince, 2012; C. Ye, Bogovic, Ying, & Prince,

2013; S. Zhang, Correia, & Laidlaw, 2008). These methods approach this problem either

by fiber tracking or by voxel level’s classification/clustering based on features extracted

from DTI. However, none of the existing methods adequately segments the dSCP, the

region where the SCPs cross. The segmentation method in Bazin et al. (2011) can

explicitly model dSCP and try to trace them by feature matching according to an atlas

registered to the subject. While because of the small size of the dSCP, this method fails to

register the feature atlas close enough to find dSCPs. Ye et al. (2013) improved this

method by incorporating the linear Westin index (Westin, Peled, Gudbjartsson, Kikinis,

& Jolesz, 1997) as an additional feature, but it is still insufficient to segment the dSCP

accurately.

To address this problem, Ye et al. (2015) proposed a new automatic segmentation

method consists of a random forest classifier and a multi-object geometric deformable

model. The method models the dSCP, the SCPs, the MCP, and the ICPs as separate

objects based on the observation that the diffusion properties in these regions show

11

certain homogenous properties. Features including the primary eigenvectors (PEVs) of

the tensors, the Westin indices describing the shape of the tensors (Westin et al., 1997),

and the spatial position information are used to train a random forest classifier (RFC)

(Breiman, 2001) from manual delineations. A further segmentation step is employed

using a multi-object geometric deformable model (MGDM) (Bogovic, Prince, & Bazin,

2013) to refine and smooth out the boundaries. As defined in this paper (Breiman, 2001),

random forest is a classifier consisting of a collection of tree-structured classifiers. RFC

is a supervised classifier based on decision trees, which vote for the most popular class.

Therefore, significant improvements of classification accuracy are obtained compared

with a single decision tree.

Three kinds of features—the PEV, the Westin indices, and a registered

template—are used as inputs of the RFC for identifying the cerebellar peduncles. The

PEV is a useful feature for the identification of tracts. But the PEV is unable to

distinguish the tract direction in the dSCP where the SCPs cross. Thus, the PEV is

mapped into a 5D Knutsson space (Knutsson, 1985), creating five Knutsson features that

handle the bidirectional ambiguity of the PEV. The Westin indices, including the linear

index, the planar index, and the spherical index, describe how linear, planar, and

spherical a tensor is shaped. Since values of Westin indices are different in the

noncrossing tracts, crossing tracts, and isotropic areas, they can be used as features to

differentiate them. Registering a template from manual delineation to the subject to be

segmented can provide an initial estimation of the spatial locations of the cerebellar

peduncles. SyN registration (Avants, Epstein, Grossman, & Gee, 2008) was used to

provide a reliable registration of the template to the target subject. To incorporate the

12

information from SyN registration into the RFC, signed distance functions (SDFs) were

calculated from the transformed labels. SDFs can indicate how far a voxel of the target

subject is from the registered labels, serving as spatial information of the spatial locations

of cerebellar peduncles.

The RFC provides an initial classification of the cerebellar peduncles, but since 1)

the RFC applies to each voxel independently and 2) the RFC training may have

unbalanced samples (where the more numerous classes tend to be favored in RFC

decisions producing a bias in the sizes), a further step for refining the initial classification

is required. Therefore, MGDM (Bogovic et al., 2013) was applied to provide both spatial

smoothness and additional fidelity to the data.

2.2 Quality Assurance in Medical Imaging Field

Extensive, consistent, and regular QA is an essential part of medical imaging. QA

in magnetic resonance imaging (MRI) field is mainly focused on the imaging systems

(Gallichan et al., 2010; Ihalainen, Sipila, & Savolainen, 2004; Z. J. Wang, Seo, Chia, &

Rollins, 2011; Yung, Stefan, Reeve, & Stafford, 2015), DTI image quality (Asman,

Lauzon, & Landman, 2013; Lauzon et al., 2013), and algorithms of medical image

processing and analysis (Rodrigues et al., 2012; Saenz, Kim, Chen, Stathakis, & Kirby,

2015; Sharpe & Brock, 2008). The quality of imaging system can affect the quality of the

output images, which as inputs can affect the final results of algorithms of medical

imaging processing and analysis. Quality assurance of the three aspects is therefore

dependent to some extent.

Quality assurance for the medical imaging systems is generally conducted by

13

comparing parameters of images obtained from these systems using phantoms. Ihalaine et

al (2004) proposed to develop a long-term quality control protocol for the six magnetic

resonance imagers in their organization in order to assure that they fulfill the same basic

image quality requirements. They used the same Eurospin phantom set and compared 11

identical imaging parameters with each imager. They are image uniformity, ghosting,

SNR and its uniformity, geometric distortion, slice thickness, slice position, slice wrap,

resolution, and T1 and T2 accuracy. Results proved that the six imagers were operating at

a performance level adequate for clinical imaging.

Wang et al. (2011) presented a similar quality assurance procedure for routine

clinical DTI using the widely available American College of Radiology (ACR) head

phantom. They analyzed the data acquired at 1.5 and 3.0 T on whole body clinical MRI

scanners and compared parameters including 1) the signal-to-noise ratio (SNR) at the

center and periphery of the phantom, 2) image distortion by EPI readout relative to spin

echo imaging, 3) distortion of high-b images relative to the b=0 image caused by

diffusion encoding, and 4) determination of fractional anisotropy (FA) and mean

diffusivity (MD) measured with region-of-interest (ROI) and pixel-based approaches.

In Yung et al (2015), a semi-automated, open source MRI QA program for multi-

unit institutions was developed. With the reviewable database of phantom measurements,

historical data can be reviewed to compare previous year data and to inspect for trends.

The QA approach in this paper is the same as the previous ones. Measurements using

phantoms assess geometric accuracy and linearity, position accuracy, image uniformity,

signal, noise, ghosting, transmit gain, center frequency, and magnetic field drift.

Currently, quality inspection of DTI data has relied on visual inspection and

14

individual processing in DTI analysis software programs (e.g., DTIPrep, DTI-studio). A

DTI experiment can consist of up to 90 or more volumes, be aggressive on hardware, and

be susceptible to standard as well as unique artifacts (Gallichan et al., 2010). Quality

assurance for DTI data therefore is really important and challenging. To take the

advantage of applied statistical methods for several metrics to assess parameters of DTI

data, Lauzon et al. (Lauzon et al., 2013) presented an automatic DTI analysis and quality

assurance pipeline. Parameters computed on DTI data include noise level, artifact

propensity, quality of tensor fit, variance of estimated measures, and bias in estimated

measures. The pipeline completes in 24 hours for one DTI data, stores statistical outputs,

and produces a graphical summary QA report. They analyzed 608 DTI datasets using this

pipeline. The efficiency and accuracy of quality analysis using this pipeline was

compared with visual inspection.

QA for medical image processing and analysis algorithms is very limited. There is

no uniform QA framework/approach because of the uniqueness and specific aspects of

algorithms and assessment targets in each project. In Rodrigues et al. (2012), a

quantitative QA method for contour compliance referenced against a community set of

contouring experts was proposed. They studied two clinical tumor site scenarios and for

each case, physicians segmented various target/organ at risk structures to define a set of

community reference contours. In each set of community contours, a consensus contour

(Simultaneous Truth and Performance Level Estimation (STAPLE)) was created.

Consensus-based contouring penalty metric scores quantified differences between each

individual community contour and the group consensus contour. They reported the outlier

contours identified by the QA system and analyzed possible reasons afterwards.

15

Seenz et al. (2015) proposed to determine how detailed a physical phantom needs

to be to accurately perform QA for a deformable image registration (DIR) algorithm.

Virtual prostate and head-and-neck phantoms, made from patient images, were used for

this study. Both sets consist of an undeformed and deformed image pair. They found that

a higher number of tissue levels creates more contrast in an image and enables DIR

algorithms to produce more accurate results.

QA approaches in medical imaging systems, DTI data quality, and algorithms are

reviewed above. Overall, QA is important yet not fully studied for medical imaging field.

More efficient, accurate, and automatic QA approaches remain to be further developed.

This thesis considers a particular approach for a specific algorithm, and therefore

contributes to the general state of knowledge in quality assurance for medical image

analysis.

2.3 Outlier Detection Methodologies

The presence of outliers can be a problem for data analysis in many fields. Outlier

identification herein is an important part of data screening process to detect and/or

remove consequent abnormal observations (Hodge & Austin, 2004). Outliers can results

from various reasons such as human error, systematic errors, fraudulent behavior, or

simply natural deviations in populations. Considering there is no universally accepted

definition of an outlier, we take the definition of Grubbs (1969), who defined an outlying

observation, or outlier, to be “one that appears to deviate markedly from other members

of the sample in which it occurs”. This review focuses on a general overview of outlier

detection methodologies rather than a specific method for a specific problem. Outlier

16

detection methods originated from statistics and machine learning fields are introduced

briefly.

Statistical methods are widely used in outlier detection. For univariate outlier

detection, Grubbs (1969) presented several recommended criteria for determining outliers.

One of these criteria is the Z value, which is the difference between the mean of data and

the query value divided by the standard deviation. The Z value is then compared with a

1% or 5% significance level for outlier detection. All parameters are directly calculated

from the data. Large data number herein can represent the sample statistically better.

Another very simple and fast statistical outlier detection technique proposed by

Laruikkala et al. (Laurikkala et al., 2000) is box-whisker plot to pinpoint outliers. Box

plots give the lower and higher extremes, lower and higher quartiles, median of data, and

outliers. The outliers are data outside of the 1.5 x interquartile range beyond the lower

and upper extremes. The whisker value 1.5 can be adjusted according to different datasets.

Box plots require no data distribution assumption but need a predefined range of outliers.

For multivariate outlier detection, Mahalanobis distances (De Maesschalck,

Jouan-Rimbaud, & Massart, 2000) is the primary choice for many cases. This distance

measure incorporates the dependencies between the variables, which is essential in

multivariate outlier detection. Other distance metrics, such as Euclidean distance using

only location information, are not as accurate as Mahalanobis distance. While the

Mahalanobis distance can be computationally expensive compared with the Euclidean

distance since it requires an entire dataset to identify the variable correlations. K-nearest

neighbor (KNN) for outlier detection calculates the nearest neighbors of a data using a

proper distance metric, such as Euclidean distance or Mahalanobis distance. It is a

17

proximity-based method with no prior assumption about the data distribution. When the

dimension and size of the data increase, this method can be computationally expensive.

The methods described above cannot scale well unless modifications are made to

them. Parametric methods are suitable for large data sets since the model grows only with

model complexity instead of data size. While, the prerequisite for using this kind of

methods is the assumption of the data distribution model, which may not reflect the true

distribution of data in some cases. Semi-parametric methods aim to combine the speed

and complexity growth feature of parametric methods with the model flexibility of non-

parametric methods. Roberts et al. (Roberts & Tarassenko, 1994) used a Gaussian

mixture model to learn a model of normal data and detect abnormal observations. Each

mixture represents a kernel with width autonomously determined by the spread of the

data.

Additional to statistical methods, outlier detection can also be achieved using

machine learning. Regression methods using linear models are also widely used, but they

can be too simple for handling some practical cases. Therefore, support vector machines

(SVMs) (Cortes & Vapnik, 1995) have been proposed to address this problem. In SVMs,

the input data is projected to higher dimensional space by a kernel function to find a

hyperplane that distinguishes normal data and outliers. The kernel can be a linear dot

product, a polynomial function, or a sigmoid function. SVMs can generate classifiers

from poorly balanced data, which is often the case in medical domains where abnormal

data is rare or difficult to obtain. Tax et al. (Tax, Ypma, & Duin, 1999) applied an SVM

for two-class medical classification. Dreiseitl et al. (Dreiseitl, Osl, Scheibbock, & Binder,

2010) employed one-class SVMs modeling only the normal data for detecting abnormal

18

subjects in melanoma prognosis. They compared their methods with a two-class

classification method and came to the conclusion that their method can be used as an

alternative.

Statistical methods primarily focus on real-valued data and require cardinal or

ordinal data to allow vector distances to be calculated. Methods derived from machine

learning can handle categorical data with no ordering. For example decision trees are

robust and do not require any prior knowledge of the distribution of the data, but they

generate simple class boundaries compared with the complex class boundaries by SVM

or neural networks. To improve the accuracy, ensemble classification methods, for

example, random forests, were proposed (Breiman, 2001). This classification method is

described in Section 2.1, so no additional theory is described here. Generally, the random

forest classifier consisting of a collection of decision trees, and performs better than a

single decision tree.

Though outlier detection has been applied in many fields such as fraud detection,

activity monitoring, network performance, structural defect detection, time-series

monitoring, medical condition monitoring etc., there is very limited study of quality

assurance using outlier detection for medical image segmentation algorithms.

Considering our case with only 249 data and the relatively low dimension of variables

(number of features < 30), we applied the simple statistical method using box-whisker

plots first. Then with available ground truth for segmentation failures and effective

features for indicating outliers, we moved to classification methods and utilized several

classifiers including SVM and random forest classifier for detecting outliers. Some data

mining algorithms based on tree structured indices and cluster hierarchies (T. Zhang,

19

Ramakrishnan, & Livny, 1996) are robust, but are specifically optimized for clustering

large data set. Therefore, it is not proper for our case.

20

Chapter 3 Validation of the New

Algorithm Pipeline

There are two segmentation pipelines: the original pipeline consisting of two

separate pipelines used in the paper reporting the automatic segmentation method of

cerebellar peduncles (Chuyang Ye et al., 2015) and the new integrated pipeline, CPSeg,

used in this thesis. A dataset containing 48 subjects were first preprocessed using a

pipeline for registration and estimating diffusion tensors. Then the computed tensors were

processed using the old segmentation pipeline, namely RFC+MGDM, and the new

CPSeg pipeline. Six segmentation labels of the CPs were the final outputs of the two

segmentation pipelines. We then compared the segmentation labels of the two pipelines

and found they were different. To guarantee the differences are not statistically

significant, we did a quantitative validation of the CPSeg pipeline. We computed the

Dice coefficients and average surface distances (ASDs) between segmentation labels and

the corresponding manual delineations of 10 subjects in this dataset with 48 subjects. We

then used a Paired student’s t -test and a Wilcoxon signed-rank test to statistically

compare the Dice coefficients and ASDs; results show that the segmentation labels using

the new segmentation pipeline is not statistically different from those using the old

pipeline. The details are provided in this chapter.

21

3.1 Algorithm Pipelines

3.1.1 CATNAP

CATNAP (Landman, Farrell, Patel, Mori, & Prince, 2007), short for

Coregistration, Adjustment and Tensor-solving, a Nicely Automated Program, is a data

preprocessing pipeline for Philips DTI (PAR/REC) and MRI data. CATNAP can adjust

diffusion gradient directions for scanner settings, correct motions and eddy current

artifacts, and compute diffusion tensors and parameters such as fractional anisotropy

(FA), mean diffusivity (MD), and Westin indices. The computed diffusion tensors are

used as inputs to the segmentation pipeline (RFC+MGDM and CPSeg). The CATNAP

pipeline used in Ye et al. (2015) has no distortion correction function. We call it

CATNAP-v1 to differentiate it from a slightly different version, CATNAP-v2 (described

in Chapter 5), which was used for another data set. The two CATNAPs, RFC+MGDM,

and CPSeg pipelines were all implemented using the Java Image Science Toolkit (JIST)

(Lucas et al., 2010). JIST is an algorithm development framework, which supports java-

based rapid prototyping, improving the efficiency of evaluating new algorithms.

3.1.2 The segmentation pipelines

The old segmentation pipeline (RFC+MGDM) consists of a RFC process and a

MGDM processes, as shown in Figure 3.1.

22

(a) (b)

Figure 3.1 The old segmentation pipeline: (a) is the RFC process and (b) is the MGDM

process.

23

In the training phase of the RFC, the random forest (RF) model, which is the

output of the Train RF WM Initialization module in the RFC pipeline, is trained from a

training set of nine subjects, including five healthy controls and four spinocerebellar

ataxia type 6 (SCA6) patients. The type of features, including signed distance functions

(“Dist”), 5D Knutsson vector (“5D”) and linear Westin index (“Westin”) are used. The

pipelines carrying out feature extraction are not shown here. Manual delineations

(“Manual”) for training were obtained by a trained expert. The trained random forest

model, together with the three kinds of features and a predefined search range of 10 mm,

are inputs to the RF WM Initialization module, which implements the RFC. Initial

segmentations as well as a corresponding membership and a processing brain mask are

outputs of this module. In the MGDM pipeline, the initial segmentations from the RFC

are used as inputs. The MgdmBoundary module, which implements MGDM, is used for

refining and smoothing out the boundaries of the initial segmentations.

The new integrated segmentation pipeline, CPSeg, is shown in Figure 3.2. It

consists of three processes: feature extraction, RFC, and MGDM. Five inputs of the

CPSeg are also marked in Figure 3.2. The diffusion tensor is estimated using the

CATNAP-v1. The template label (true segmentation label) and template linear Westin

index were manually delineated from a healthy control with ID at1029. The RF model is

the same as that in the RFC+MGDM pipeline.

25

3.2 Comparison of Segmentation labels

Theoretically, the new and old pipelines should be the same since the new one is

just a composition of the two old ones. To test this hypothesis, we processed a data set

with 48 subjects using CATNAP-v1 and CPSeg and compared the segmentation results

with those processed by CATNAP-v1 and RFC+MGDM.

3.2.1 Description of the Tomacco dataset

The Tomacco dataset contains a total of 48 subjects: 18 healthy controls, 6

patients with SCA6, and 24 patients with other neurological diagnoses that affect the

cerebellum. Each subject has a set of several DTI and magnetization-prepared rapid

acquisition gradient echo (MPRAGE) scans. Diffusion weighted images (DWIs) were

acquired using a multi-slice, single-shot EPI sequence on a 3T MR scanner (Intera,

Philips Medical Systems, Netherlands). The sequence has 32 gradient directions and one

b0 image. The b value is 700 s/mm2. The resolution in the XY plane is 2.2mm × 2.2mm

with 96 × 96 slices. The resolution of the output images generated by the scanner is

0.828mm × 0.828mm × 2.2mm. We registered the MPRAGE images to MNI space to get

the isotropic resolution 1mm.

We successfully processed all the 48 Tomacco datasets using the CPSeg with

estimated tensors as inputs computed using the CATNAP-v1. It takes around 2.5 hours to

process one subject using the CATNAP-v1 and 40 minutes using the CPSeg. The total

time for processing one subject is around 3 hours. For each subject, the final outputs of

the CPSeg are six segmentation labels 1–6 representing the left SCP (lSCP), right SCP

26

(rSCP), dSCP, MCP, left ICP (lICP), and right ICP (rICP), respectively. The Tomacco

dataset had been processed using the RFC+MGDM with tensors computed using the

CATNAP-v1. In the following section we compare the segmentations labels from the

CPSeg and the RFC+MGDM.

3.2.2 Comparison results

We used Linux command “diff” to compare estimated tensors from the

CATNAP-v1 and the final segmentation labels by using the two segmentation pipelines.

Results show that, the estimated tensors we processed using the CATNAP-v1 were

exactly the same as those processed by Ye using the same CATNAP-v1. While, the final

segmentation labels by using the CPSeg were different from those by using the

RFC+MGDM.

Since final segmentation labels of the MGDM process directly depends on the

outputs of the RFC process, we then compared the three outputs–the initial segmentation

labels, the brain masks, and the memberships–of the two RFC processes in the

RFC+MGDM and the CPSeg pipelines. Figure 3.3(a) shows that these three results from

the two RF WM Initialization modules, which implements the two RFC processes in the

two segmentation pipelines, are different.

28

Then we checked the two RF WM Initialization modules and found their versions

and parameters of inputs were different. To figure out whether the different module

versions caused the different outputs, we ran the two modules given the same inputs and

compared their three outputs. Results showed that the segmentation labels of the two

modules were the same, while the brain masks and the memberships were still different.

A portion of the brain masks and memberships generated by the RF WM Initialization

module in the RFC+MGDM pipeline were chopped, as shown in Figure 3.3(b). This

indicates that the two RF WM Initialization modules in the two segmentation pipelines

are indeed different in some way.

We checked all the other modules before the RF WM Initialization in a similar

way and found another reason for the differences of the final segmentation labels of the

CPSeg and RFC+MGDM pipelines. We found that, given the same inputs, the CP

Template Registration module in the CPSeg pipeline generated a different registered

template compared with that by using SyN registration (Avants et al., 2008) in the

RFC+MGDM pipeline. The CP Template Registration module registers the template

label and the template linear Westin index of a subject with ID at1029 to the target linear

Westin index of a subject to be segmented and creates a registered template. Since this

registered template is used for calculating the SDFs, the feature of spatial location

information of cerebellar peduncles used to train the RFC in the CPSeg pipeline, it can

influence the final segmentation results. Thus, the differences between the registration

module (CP Template Registration) in the CPSeg pipeline and the corresponding SyN

registration process in the RFC+MGDM pipeline is a reason for the final different

segmentation labels in the two segmentation pipelines.

29

In conclusion, the new segmentation pipeline CPSeg cannot generate the same

segmentation labels as the old pipeline RFC+MGDM. Two possible reasons for it were

analyzed. Then we need to validate the segmentation results using the tow pipelines to

guarantee they are not statistically significant. Although the results are different, it is

possible that the differences are not statistically significant relative to the final quantities

that we will compare in population studies. For this reason we next analyzed the statistics

of the two final segmentation results by using the two pipelines, and these results are

described in the following section.

3.3 Statistical Tests

3.3.1 Tests on the Dice coefficients

We first calculated the Dice coefficients (Dice, 1945) between manual

delineations and the initial segmentation labels in the two RFC processes in the

RFC+MGDM and CPSeg pipelines, respectively. We then calculated the Dice

coefficients between manual delineations and the final segmentations from the two

MGDM processes in the RFC+MGDM and CPSeg pipelines, respectively. The two

groups of Dice coefficients are shown in Tables 3.1 and 3.2, respectively. For

convenience, the RFC segmentation results and the final segmentation results after

MGDM refinement are called “RFC” and “RFC + MGDM” in the RFC+MGDM pipeline

and called “RFC_of_CPSeg” and “CPSeg” in the CPSeg pipeline.

30

Table 3.1 The Dice coefficients between the manual delineations and the segmentation

labels in the two RFC processes in the RFC+MGDM and CPSeg pipelines, respectively.

ISCP rSCP dSCP MCP IICP rICP ISCP rSCP dSCP MCP IICP rICPS1 0.828 0.793 0.702 0.826 0.753 0.762 0.828 0.798 0.715 0.827 0.753 0.763S2 0.776 0.766 0.753 0.86 0.67 0.66 0.768 0.768 0.765 0.853 0.674 0.665S3 0.774 0.722 0.834 0.831 0.712 0.71 0.773 0.722 0.834 0.832 0.709 0.71S4 0.739 0.797 0.719 0.874 0.655 0.616 0.741 0.797 0.732 0.873 0.653 0.611S5 0.82 0.797 0.816 0.856 0.777 0.728 0.812 0.805 0.816 0.859 0.78 0.731S6 0.813 0.778 0.286 0.838 0.678 0.689 0.801 0.767 0.31 0.835 0.675 0.691S7 0.833 0.82 0.755 0.865 0.72 0.68 0.823 0.815 0.725 0.867 0.728 0.68S8 0.807 0.785 0.826 0.829 0.64 0.668 0.812 0.789 0.816 0.826 0.636 0.686S9 0.785 0.763 0.787 0.851 0.739 0.674 0.786 0.759 0.818 0.853 0.729 0.659Mean 0.797 0.78 0.72 0.848 0.705 0.688 0.794 0.78 0.726 0.848 0.704 0.689Std. 0.031 0.028 0.169 0.017 0.047 0.042 0.029 0.029 0.162 0.018 0.048 0.044

RFC RFC_of_CPSeg

Table 3.2 The Dice coefficients between the manual delineations and the final

segmentation labels in the two MGDM processes in the RFC+MGDM and CPSeg

pipelines, respectively.

ISCP rSCP dSCP MCP IICP rICP ISCP rSCP dSCP MCP IICP rICPS1 0.839 0.762 0.689 0.817 0.782 0.787 0.839 0.763 0.696 0.817 0.782 0.786S2 0.824 0.816 0.815 0.85 0.675 0.676 0.825 0.819 0.814 0.851 0.679 0.685S3 0.803 0.783 0.87 0.828 0.759 0.752 0.806 0.779 0.871 0.828 0.756 0.749S4 0.813 0.809 0.758 0.87 0.695 0.648 0.812 0.805 0.758 0.87 0.699 0.648S5 0.798 0.775 0.862 0.864 0.777 0.778 0.794 0.764 0.862 0.864 0.781 0.776S6 0.786 0.77 0.711 0.843 0.704 0.729 0.778 0.763 0.745 0.84 0.702 0.729S7 0.769 0.795 0.775 0.872 0.731 0.702 0.779 0.794 0.8 0.873 0.749 0.702S8 0.785 0.767 0.785 0.851 0.681 0.705 0.784 0.762 0.803 0.846 0.669 0.702S9 0.799 0.795 0.833 0.855 0.765 0.707 0.791 0.795 0.862 0.857 0.757 0.707Mean 0.802 0.786 0.789 0.85 0.73 0.72 0.801 0.783 0.801 0.85 0.73 0.72Std. 0.021 0.019 0.063 0.018 0.042 0.046 0.021 0.021 0.06 0.019 0.043 0.044

RFC+MGDM CPSeg

31

To show the statistical significance of the segmentation differences between the

CPSeg and RFC+MGDM pipelines, a paired Student’s t -test and a Wilcoxon signed-

rank test were conducted with respect to the Dice coefficients. The p -values of the two

tests are shown in Table 3.3.

In Table 3.3 we can see that the p value from final segmentations of the dSCP on

both tests are smaller than 0.05, the significance level we chose. This indicates that the

segmentation results of the dSCP between the CPSeg and RFC+MGDM pipelines are

significantly different. In Table 3.2 we can see that the average Dice coefficient of the

final segmentations of the dSCP using the CPSeg pipeline is 0.801, which is larger than

0.789, the average Dice coefficient of the final segmentations of the dSCP using the

RFC+MGDM pipeline. This indicates that the CPSeg performs better than RFC+MGDM

on segmenting the dSCP. As for the rest cerebellar peduncles, the performance between

the CPSeg and the RFC+MGDM are not statistically different.

Table 3.3 The p -values of the paired Student's t -test and the Wilcoxon signed-rank test

for comparing the Dice coefficients between the RFC and RFC+MGDM results and the

results from CPSeg.

Paired Student’s t -test

ISCP rSCP dSCP MCP IICP rICP RFC 0.141 0.958 0.363 0.913 0.765 0.767

RFC+MGDM 0.699 0.075 0.025* 0.721 0.859 0.936

Wilcoxon signed-rank test

ISCP rSCP dSCP MCP IICP rICP RFC 0.301 1 0.297 0.734 0.82 0.57

RFC+MGDM 0.57 0.078 0.031* 0.82 0.91 0.496 Note: * p < 0.05

32

3.3.2 Tests on the average surface distances (ASDs)

As another test to compare the two pipelines, we calculated the ASDs between

manual delineations and the initial segmentation labels from the two RFC processes in

the RFC+MGDM and CPSeg pipelines, respectively. Then we calculated the ASDs

between manual delineations and the final segmentations from the RFC+MGDM and

CPSeg pipelines, respectively. The two groups of ASDs are in Tables 3.4 and 3.5,

respectively.

Table 3.4 The ASDs between the manual delineations and the segmentation labels in the

two RFC processes in the RFC+MGDM and CPSeg pipelines, respectively.

ISCP rSCP dSCP MCP IICP rICP ISCP rSCP dSCP MCP IICP rICPS1 0.391 0.562 0.484 0.618 0.647 0.561 0.391 0.549 0.462 0.616 0.649 0.563S2 0.621 0.627 0.385 0.542 0.895 0.816 0.662 0.615 0.385 0.571 0.875 0.793S3 0.585 0.748 0.314 0.621 0.7 0.676 0.59 0.746 0.329 0.615 0.707 0.678S4 0.969 0.452 0.423 0.504 0.826 0.893 0.96 0.451 0.411 0.508 0.834 0.901S5 0.433 0.443 0.239 0.588 0.542 0.607 0.447 0.42 0.239 0.57 0.538 0.602S6 0.416 0.486 0.945 0.801 0.776 0.723 0.43 0.516 0.909 0.813 0.781 0.709S7 0.354 0.377 0.275 0.734 0.645 0.813 0.38 0.389 0.31 0.726 0.624 0.817S8 0.43 0.456 0.234 0.846 0.876 0.785 0.42 0.447 0.255 0.871 0.849 0.77S9 0.535 0.516 0.376 0.72 0.635 0.801 0.533 0.527 0.32 0.688 0.658 0.82Mean 0.526 0.518 0.408 0.664 0.727 0.742 0.535 0.518 0.402 0.664 0.724 0.739Std. 0.189 0.113 0.218 0.117 0.122 0.109 0.186 0.111 0.203 0.12 0.117 0.11

RFC RFC_of_CPSeg

Table 3.5 The ASDs between the manual delineations and the final segmentation labels

in the two MGDM processes in the RFC+MGDM and CPSeg pipelines, respectively.

ISCP rSCP dSCP MCP IICP rICP ISCP rSCP dSCP MCP IICP rICPS1 0.408 0.675 0.552 0.7 0.582 0.523 0.407 0.677 0.549 0.698 0.588 0.527S2 0.501 0.51 0.298 0.599 0.902 0.796 0.502 0.492 0.323 0.615 0.908 0.767S3 0.521 0.59 0.268 0.663 0.611 0.614 0.513 0.609 0.261 0.659 0.625 0.626S4 0.774 0.449 0.358 0.551 0.797 0.873 0.773 0.456 0.358 0.557 0.78 0.848S5 0.512 0.539 0.212 0.593 0.589 0.516 0.52 0.574 0.212 0.585 0.573 0.528S6 0.492 0.552 0.395 0.789 0.735 0.635 0.517 0.561 0.362 0.803 0.75 0.637S7 0.533 0.503 0.313 0.678 0.641 0.755 0.523 0.502 0.273 0.673 0.612 0.749S8 0.518 0.543 0.291 0.759 0.819 0.71 0.515 0.57 0.268 0.793 0.835 0.752S9 0.538 0.471 0.328 0.684 0.596 0.725 0.548 0.468 0.263 0.674 0.612 0.725Mean 0.533 0.537 0.335 0.668 0.697 0.683 0.535 0.546 0.319 0.673 0.698 0.684Std. 0.098 0.067 0.097 0.078 0.119 0.121 0.097 0.072 0.099 0.084 0.123 0.111

RFC+MGDM CPSeg

33

To show the statistical significance of the segmentation differences between the

new and old pipelines, a paired Student’s t -test and a Wilcoxon signed-rank test were

performed with respect to ASDs. The p -values of the two tests are shown in Table 3.6.

Results show that the segmentation performances of the two pipelines are statistically the

same.

Table 3.6 The p -values of the paired Student's t -test and the Wilcoxon signed-rank test

for comparing the ASDs between the RFC and RFC+MGDM results and the results from

CPSeg.

Paired Student’s t -test

ISCP rSCP dSCP MCP IICP rICP RFC 0.147 0.911 0.543 0.935 0.581 0.592

RFC+MGDM 0.546 0.145 0.107 0.377 0.827 0.855

Wilcoxon signed-rank test ISCP rSCP dSCP MCP IICP rICP

RFC 0.203 0.734 0.641 1 0.91 0.734 RFC+MGDM 1 0.164 0.109 0.57 1 0.734 Note: * p < 0.05

3.3.3 Conclusion

The paired Student’s t -test and the Wilcoxon signed-rank test with respect to the

Dice coefficients and the ASDs both show that the CPSeg and the RFC+MGDM are not

statistically different. The CPSeg performs even better than the RFC+MGDM on

segmenting the dSCPs. We can therefore use the CPSeg pipeline to process other datasets

and use these results as the basis of scientific conclusions.

34

Chapter 4 Outlier Detection on the

Tomacco Dataset

This chapter studies the performance of the automatic segmentation methods

described in Chapter 3 using the box-whisker plot, which implements a simple univariate

outlier detection method. Box plots display the lower extreme, lower quartile (25%),

median, upper quartile (75%), and upper extreme points of the data. Between the lower

and upper quartiles is the interquartile range (IQR), namely 50% of the data. Usually, a

data point is defined as an outlier when it is 1.5×IQR or more above the upper quartile, or

1.5×IQR or more below the lower quartile. This range can vary in different datasets.

The Tomacco datasets were processed using CATNAP-v1 for calculating

diffusion tensors and then using CPSeg for automatic segmentation of the CPs.

Additional information about this dataset is described in Chapter 3. We manually

categorized the automatic segmentations in the Tomacco dataset as a successful

segmentation or a segmentation failure. We then designed three kinds of features of the

image data from the categorized failures. Outliers were detected using the box-whisker

plot and they were considered as possible algorithm failure detections. We then evaluated

each feature’s performance based on the true positive and false positive rates.

4.1 Categorization of Automatic Segmentations

Successful and failed segmentations in our study were manually categorized by

36

We categorized all the segmentation labels of the 48 subjects in the Tomacco

dataset and assigned numerical scores to each subject for assessing the segmentation

quality. The Tomacco dataset’s information together with the categories and scores are

listed in Table 4.1. The “1” in this table means a segmentation failure while the “0”

means a successful segmentation. The numerical scores are integers ranging from 0 to 10.

Failed and successful segmentations are assigned scores ranging from 0–6 and 7–10

inclusively, respectively. Within the normal segmentations, there are some imperfect

ones with scores 7, 8, or 9. For example, segmentations with a small portion of the CPs

missing are considered as imperfect but successful. As shown in Table 4.1, among all the

Tomacco data, nine are categorized as segmentation failures and the rest are categorized

as successful segmentations. In the following section, we look into these failures to see

whether there are image features that can indicate potential segmentation failures.

37

Table 4.1 Information of the 48 subjects in the Tomacco dataset including ID, diagnoses,

gender, categories, and scores.

Number ID Diagnosis Gender Category Score Number ID Diagnosis Gender Category Score1 at1002 cb M 1 1 25 at1018 control F 0 102 at1103 control M 1 1 26 at1020 control F 0 103 at1016 cb M 1 4 27 at1021 control M 0 104 at1083 control F 1 4 28 at1022 cb F 0 105 at1034 SCA6 F 1 5 29 at1023 cb M 0 106 at1046 cb M 1 5 30 at1024 nph M 0 107 at1049 SCA6 F 1 5 31 at1025 cb F 0 108 at1007 cb F 1 5 32 at1026 control M 0 109 at1078 cb+ F 1 6 33 at1027 ?cb M 0 1010 at1081 control M 0 8 34 at1028 cb+ M 0 1011 at1032 control F 0 9 35 at1029 control F 0 1012 at1041 cb+ M 0 9 36 at1031 control F 0 1013 at1056 ?cb F 0 9 37 at1033 SCA6 F 0 1014 at1080 control F 0 9 38 at1036 cb+ F 0 1015 at1086 control M 0 9 39 at1038 vest M 0 1016 at1000 sca6 M 0 10 40 at1040 fam17/control M 0 1017 at1003 cb M 0 10 41 at1043 cb M 0 1018 at1005 fam17/cb F 0 10 42 at1044 control F 0 1019 at1006 cb M 0 10 43 at1045 control M 0 1020 at1011 cb F 0 10 44 at1048 SCA6 F 0 1021 at1013 ?cb M 0 10 45 at1060 cb M 0 1022 at1014 cb M 0 10 46 at1079 control F 0 1023 at1015 cb M 0 10 47 at1082 control F 0 1024 at1017 SCA6 M 0 10 48 at1084 control F 0 10

Note:LCategoryL1L:LsegmentationLfailure;LCategoryL0L:LsuccessfulLsegmentation

38

4.2 Feature Extraction

To design features for finding potential segmentation failures, we need to look

into the nine categorized failures in the Tomacco dataset first. Generally there are two

kinds of segmentation failures. The first kind is one with incomplete labels of the six

CPs. For example, some segmentation results only have three or four labels out of six.

Figures 4.2(b), (c), and (d) are three failures in this case. The labels for lSCP, rSCP and

dSCP are missing in segmentation results of subjects with IDs at1034 and at1049, shown

in Figures 4.2(b) and (c). The labels for dSCP and rICP are missing in the segmentation

result of the subject with ID at1083.

We define this kind of segmentation failure based on the assumption that every

person should have complete six labels. Subjects with ataxia may have relatively smaller

CPs, but the six structures should be complete. Maybe when the quality of the DTI scans

is poor, the smallest CP (dSCP) is not obvious in the linear Westin index for automatic

segmentation. Then the segmentation algorithm may fail to find it. This problem can

result from poor data quality rather than a flaw in the algorithm pipeline itself. Since we

are doing quality assurance on the whole pipeline including the raw datasets (DTI scans

and MPRAGE structural images), we consider this as a segmentation failure rather than a

successful segmentation. Obviously, whatever the reason for the missing labels, this kind

of segmentation results cannot be used for scientific analysis. Thus, given this

perspective, treating these kinds of segmentations as failures is reasonable.

The second kind of segmentation failure is one with abnormal shape, size, or

relative positions of the six CPs. In Figure 4.2, an image of a successful segmentation (for

39

reference) and nine images of nine segmentation failures of the Tomacco dataset are

shown in a similar anatomical position. In Figures 4.2(d) and (e), large portions of MCPs

are cut off. Figures 4.2(f)–(j) show failures with abnormal shapes of CPs.

In addition to the nine failures, there are also six imperfect segmentations in the

Tomacco dataset. Usually imperfect segmentations are those with a small missing portion

of a CP. For example, Figure 4.3 shows images where small portions of the MCP and

ICP have been cut out. By checking intermediate results in the whole pipeline

(CATNAP-v1 and CPseg), we found that the most likely source for the imperfect cases

lies in the brain masks, which were generated by a skull-stripping module in the

CATNAP-v1. Incomplete or asymmetric brain masks can cut off a portion of the CPs in

the Westin indices, which causes incomplete final segmentation images. Generally this is

not a big issue since, based on visual inspection, we found that the missing portions of

some CPs are very small that they can be ignored. However, if the degree of the

abnormality of a brain mask is too high, it can cut off a big portion of some CPs in a

segmentation result and make it a failure, which cannot be used for further scientific

analysis. For example, as shown in Figure 4.2(e), in a segmentation failure of a subject

with ID at1007, a large portion of the left brain mask is missing and this mask cuts off a

large portion of the left MCP.

42

Based on the analysis of the failed and imperfect segmentations above, three

kinds of features are designed. The first kind of feature is object oriented and

characterizes the failures on the peduncle level. The volumes and surface areas of CPs are

the two features in this category and computed directly from the segmentation labels,

notated as V = [v_ lSCP,v_ rSCP,v_ dSCP,v_MCP,v_ lICP,v_ rICP] and

S = [s_ lSCP, s_ rSCP, s_ dSCP, s_ lICP, s_ rICP] , respectively.

The second kind of feature is data quality oriented. Since the RFC is trained using

the linear Westin index (computed from the diffusion tensor) the tensor’s quality is

related to the segmentation quality directly. For example, the subjects in Figures 4.2(b),

(c), (i), and (j) have dim or abnormal linear Westin indices, which makes the structure of

the CPs very unclear. Thus, segmentation qualities of the four subjects are also greatly

affected. So we can conclude that the diffusion tensors and the failures are correlated.

Based on this observation, we chose some tensor related parameters as features. In

particular, we chose the means and standard deviations (SDs) of the fractional anisotropy

(FA), mean diffusivity (MD), and the three Westin indices of the whole brain,

respectively. The three features are notated as FA = [mean_FA, std _FA] ,

MD = [mean_MD, std _MD] , and

WI = [mean_Cl, std _Cl,mean_Cp, std _Cp,mean_Cs, std _Cs] , respectively. Cl ,Cp ,

and Cs are linear, planar and spherical Westin indices, respectively. Considering that the

linear Westin index of each peduncle is used train the RFC, to avoid its circular use, we

calculated all these parameters on the whole brain instead of just on the peduncles.

The third kind of feature is related to the brain masks. Analysis of segmentation

43

failures shows that abnormal brain masks can make the linear Westin index incomplete,

which may cut off some structures of the CPs. To find failures result from this reason, the

volume and symmetry of each subject’s brain mask are used as two features. We also

used the volumes of the two half brain masks as features, considering the incompleteness

of brain masks may occur in only onside, either right or left. The feature vector is

BM = [v_BM,v_ leftBM,v_ rightBM, sym] , where v_BM , v_ leftBM , and

v_ rightBM are volumes of a whole brain mask and its two halves, and

sym = v_ leftBM / v_ rightBM −1 is for representing the symmetry of a brain mask.

Generally, brain mask features are indirect and may not perform as well as other features

since failures resulting from this reason are relatively rare in the Tomacco dataset.

In summary, three kinds of features are used to try to detect segmentation failures.

The feature vector of a segmentation result is F = [V,S,FA,MD,WI,BM ] with a total of

26 numerical features.

4.3 Outlier Detection Results

We take the definition of an outlier from Grubbs (1969)—“An outlying

observation, or outlier, is one that appears to deviate markedly from other members of the

sample in which it occurs.” Outliers in our numerical data were detected using a box-

whisker plot. The bottom and top of the box are the first and third quartiles of the data.

Between them is the Interquartile Range (IQR), namely 50% of the data. If a data point is

1.5×IQR or more above the third quartile, or 1.5×IQR or more below the first quartile, it

is detected as an outlier. The notch in the boxplots displays a confidence interval around

44

the median. If two boxes’ notches do not overlap, there is a strong evidence (95%

confidence) that their medians differ.

Boxplots with outliers of computed features from the 48 Tomacco datasets are

shown in Figures 4.4, 4.5, 4.6, 4.7, and 4.8. Volumes and surface areas of the six CPs of

the manual delineations of 10 Tomacco datasets are compared with the corresponding 10

automatic segmentations and they are connected by dashed lines, as shown in Figures 4.4

and 4.5. Ideally, the dashed lines should be parallel. However, since the validated

segmentation pipeline (CPSeg) used for processing the Tomacco data set is different from

the original separate pipelines (RFC + MGDM), differences in volumes and surface areas

can be expected. The positions of the notches in the paired results show that their

medians are statistically the same.

We also found that outliers in these boxplots cover several kinds of diagnoses

rather than a specific one. This indicates that the segmentation algorithm can perform

well on different diseases and is not biased to a certain one.

We evaluated the performance of our selected features in the task of finding

segmentation failures by comparing the detected outliers with the categorized

segmentation failures (ground truth). For each feature, we computed the true positive and

false positive rates, as shown in Table 4.2. The features are listed in a descending order

respect to the true positive rate. To note, “volume” and “surface area” in Table 4.2 are the

volumes and surface areas of the six CPs. As long as one CP is detected as an outlier, this

segmentation with this CP is detected as an outlier, which is represented by 1.

The true positive and false positive rates of each feature in Table 4.2 show that the

object oriented features (volume and surface area of each peduncle) and some of the data

45

quality related features (mean FA and the mean and standard deviation of the linear

Westin index) are generally perform better than the brain mask features. This is

consistent with our analysis on the brain mask features in Section 4.2. Among the data

quality features, mean MD does not perform well and we do not know the reason for this.

But at this time, it is too early to conclude that this feature is useless since Tomacco is a

relatively small dataset with limited segmentation failures. So MD is still used for outlier

detection in the following chapters.

Generally speaking, this univariate non-parametric outlier detection method using

boxplots can work on the Tomacco dataset. However, for assessing the performance of

the segmentation algorithm on a larger dataset and for detecting outliers more efficiently,

methods for multivariate outlier detection are needed. Before using more complex

methods, we will study a larger dataset in the next chapter using box-whisker plots to

double check the performance of these features.

46

0

200

400

600

800

1000

1200

1400

1600

1800

2000

lSCP_1 lSCP_2 rSCP_1 rSCP_2

0

20

40

60

80

100

120

140

160

dSCP_1 dSCP_2

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

x 104

MCP_1 MCP_2

controlSCA6cb?cbcb+vestnph 0

200

400

600

800

1000

1200

1400

1600

1800

lICP_1 lICP_2 rICP_1 rICP_2

Figure 4.4 Volumes of six CPs of manual delineations of 10 Tomacco datasets (red

boxes) and automatic segmentations of the 48 Tomaaco datasets (blue boxes),

respectively.

0

200

400

600

800

1000

1200

1400

1600

lSCP_1 lSCP_2 rSCP_1 rSCP_2

0

50

100

150

200

250

dSCP_1 dSCP_2

2000

4000

6000

8000

10000

12000

MCP_1 MCP_2

control

SCA6

cb

?cb

cb+

vest

nph 0

200

400

600

800

1000

1200

1400

1600

1800

lICP_1 lICP_2 rICP_1 rICP_2

Figure 4.5 Surface areas of six CPs of manual delineations of 10 Tomacco datasets (red

boxes) and automatic segmentations of the 48 Tomaaco datasets (blue boxes),

respectively.

47

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

mean_FA

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

std_FA

1

1.1

1.2

1.3

1.4

1.5

1.6

x 10−3

mean_MD

5

6

7

8

9

10

11

12

13

x 10−4

std_MD

control SCA6 cb ?cb cb+ vest nph

Figure 4.6 Means and SDs of FA and MD of the whole brains of the 48 Tomacco

datasets.

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

mean_cl

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

std_cl

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

mean_cp

0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

0.28

std_cp

0.55

0.6

0.65

0.7

0.75

0.8

0.85

mean_cs

0.15

0.2

0.25

0.3

std_cs

control SCA6 cb ?cb cb+ vest nph

Figure 4.7 Means and SDs of the three Westin indices of the whole brains of the 48

Tomacco datasets.

48

1.5

2

2.5

3

x 106

volume_BM

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

x 106

volume_leftBM

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

x 106

volume_rightBM

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

symmetry_BM

control SCA6 cb ?cb cb+ vest nph

Figure 4.8 Brain mask features: volumes of the left, right, and whole brain masks (the

left 3 boxplots) and the symmetry of the brain masks (the rightest boxplot) of the 48

Tomacco datasets.

49

Table 4.2 Outlier detection results by selected features of the Tomacco dataset. The top

nine subjects were manually categorized as segmentation failures in this dataset.

at1002 at1007 at1016 at1034 at1046 at1049 at1078 at1083 at1103 TP.rate* FP.rate*surface.area 1 0 0 1 1 1 1 1 1 77.8% 5.1%mean.FA 1 0 1 1 1 1 0 1 1 77.8% 0.0%mean.Cs 1 0 1 1 1 1 0 1 1 77.8% 0.0%volume 1 0 0 1 0 1 1 1 1 66.7% 5.1%mean.Cl 1 0 0 1 1 1 0 1 1 66.7% 0.0%std.of.Cl 1 0 1 1 1 1 0 0 1 66.7% 0.0%mean.Cp 1 0 1 0 1 1 0 1 1 66.7% 0.0%std.of.FA 1 0 1 0 1 0 0 0 1 44.4% 0.0%std.of.Cp 1 0 1 0 1 0 0 0 1 44.4% 0.0%std.of.Cs 1 0 1 0 1 0 0 0 1 44.4% 0.0%std.of.MD 1 0 0 0 0 0 0 1 0 22.2% 0.0%v_BM 1 0 0 0 0 0 0 1 0 22.2% 0.0%

v_leftBM 1 0 0 0 0 0 0 1 0 22.2% 2.6%v_rightBM 1 0 0 0 0 0 0 1 0 22.2% 0.0%sym_BM 0 0 0 0 0 0 1 1 0 22.2% 7.7%mean.MD 0 0 0 0 0 0 0 0 0 0.0% 5.1%

Note:.1.means.the.data.is.detected.as.an.outlier;.0.means.the.data.is.not.detected.as.an.outlier*TP:True.Positive;.FP*:False.Positive

50

Chapter 5 Outlier Detection on the

Kwyjibo and Tomacco Datasets

This chapter studies the performance of the automatic segmentation methods

implemented with different CATNAPs on the Tomacco dataset and a larger dataset, the

Kwyjibo dataset, using box-whisker plots and supervised classification methods. The

Kwyjibo dataset, containing DTI and MPRAGE scans from 203 subjects: 49 healthy

controls and 154 patients with different kinds of ataxia, has the same type of data as the

Tomacco dataset since the DWIs of the two datasets were acquired using the same

sequence on the same 3T MR scanners (Intera, Philips Medical Systems, Netherlands).

Similar to the procedures on the Tomacco dataset, we first manually categorized all the

segmentation labels in the Kwyjibo dataset as successful segmentations and segmentation

failures. Then we processed Kwyjibo using CATNAP-v2 for estimating diffusion tensors

and using CPSeg for generating segmentation labels. Since CATNAP-v2 was used for

processing Kwyjibo data in the paper (Chuyang Ye et al., 2015), we used it rather than

the CATNAP-v1. Note that the Tomacco dataset was processed using CATNAP-v1,

which is slightly different from CATNAP-v2 on the parameter settings of two registration

modules and the versions of a skull-stripping module (called SPECTRE).

After the Kwyjibo dataset was processed, we found that some features, such as

surface area and MD, are statistically different from those of the Tomacco data processed

with CATNAP-v1. This is a problem since we want to merge the two datasets for further

51

study. To guarantee the validity of merging them, we reran the Tomacco data using

CATNAP-v2 and the CPSeg and verified the two CATNAP pipelines by quantitatively

comparing the volumes of the segmented CPs of the Tomacco dataset using the two

CATNAPs, respectively. Results show that the final segmentations using the two

CATNAP pipelines are different. Then by visually checking the segmentation results

processed by the two pipelines, we came to the conclusion that CATNAP-v2 performs

better than CATNAP-v1.

We next reassessed the performance of the automatic segmentation algorithm

using box-whisker plots on the reprocessed Tomacco data (using CATNAP-v2 and the

CPSeg). We found that the reprocessed Tomacco dataset has fewer segmentations

failures than the one using CATNAP-v1, which in turn further supports the conclusion

that CATNAP-v2 has a better performance.

Although the verification on the two CATNAP pipelines is not complete since we

did not compute Dice coefficients or ASDs, it at least provided some supporting evidence

for the conclusion above. (Note that Dice coefficients and ASDs were not compared

because the two algorithms use different digital grid sizes and therefore these comparison

metrics are not directly computable.) Thus, it is sensible to use CATNAP-v2 for

processing both the Tomacco and Kwyjibo data. In the next section, we merge them

together for using supervised classification methods on outlier detection.

Last, we combined the reprocessed Tomacco dataset and the Kwybjio dataset

together and used four supervised classifiers—linear discriminant analysis (LDA),

logistic regression (LR), support vector machine (SVM), and random forest classifier

(RFC). All categorized segmentation failures in the two datasets were used for training

52

these classifiers. Their performances were validated using a leave-one-out cross-

validation.

5.1 Categorization of Automatic Segmentations

Kwyjibo is a dataset containing DTI and MPRAGE scans from 203 subjects: 49

healthy controls and 154 patients with different kinds of ataxia. Some patients were

scanned more than once. This dataset can be processed the same as that of Tomacco since

the DWIs of the two datasets were acquired using the same sequence on the same 3T MR

scanners (Intera, Philips Medical Systems, Netherlands). However, there are two key

differences between the two datasets: 1) the population of subjects with ataxia is much

larger in the Kwyjibo dataset than that in the Tomacco dataset and 2) The MPRAGE

scans in the Tomacco dataset were registered to MNI space and have a 1mm isotropic

resolution while MPRAGE cans in the Kwyjibo dataset were not registered to MNI space

and have a 0.828mm isotropic resolution. Thus, the inputs (MPRAGE scans) to the two

CATNAPs have different resolutions.

To evaluate the performance of selected features to identify failed segmentations

in the Kwyjibo dataset, we first manually categorized segmentation results in the

Kwyjibo dataset as a segmentation failure or a successful segmentation in the same way

for Tomacco. Given extensive experience on the Tomacco data, it only took around three

days to categorize all the 203 datasets in Kwyjibo. The category, numerical scores for

segmentation quality, diagnosis, and gender information of eight failures and two

imperfect segmentations of the Kwyjibo data are presented in Table 5.1. Because this

dataset is relatively large, we only listed the segmentations with scores lower than 10.

53

The remaining datasets not listed here are given the full score 10 as they are successful

segmentations. We found eight segmentations failures and two imperfect but successful

segmentations in the Kwyjibo dataset, as shown in Table 5.1. The images and

descriptions of the eight segmentation failures are shown in Figure 5.1.

54

Table 5.1 Information of the categorized segmentation failures and imperfect but

successful segmentations in the Kwyjibo dataset including ID, diagnoses, gender,

categories, and scores.

Number ID Diagnosis Gender Category Score1 AT1219_20100730 axatia M 1 12 AT1556_20110802 SCA2 F 1 13 AT1275_20070402 ATF M 1 54 AT1061_20080804 MSA F 1 55 AT1061_20090306 MSA F 1 56 AT1532_20091202 sporadicFataxia F 1 67 AT1569_20101001 friedrich'sFataxia M 1 68 AT1594_20121129 SCA1 M 1 69 AT1313_20080315 controlF M 0 910 AT1315_20080324 control M 0 9

Note:FCategoryF1F:FsegmentationFfailure;FCategoryF0F:FsuccessfulFsegmentation

56

5.2 Outlier Detection Results

We used the outlier detection method on the Kwyjibo dataset in the same way as

before. Outliers detected by features (volumes and surface areas of the CPs, means and

standard deviations of the FA, MD, and three Westin indices, and the brain mask

features) on the Kwyjibo dataset are shown in Figures 5.2, 5.3, 5.4, 5.5, and 5.6. The

detected outliers in the Tomacco dataset by these features are also included in the

boxplots for comparison the data distributions of the two datasets. The outlier detection

results and true positive and false positive rates of these features are summarized in Table

5.2.

In Table 5.2, we observe that some features, which can robustly detect true

segmentation failures on the Tomacco dataset, do not work effectively on the Kwyjibo

dataset. For example, except for the spherical Westin index, all the remaining features

belonging to data quality oriented features failed to detect any true segmentation failure

by outlier detection. As well, the three brain mask features (the volumes of the right, left,

and the whole brain mask) also failed to detect true segmentation failures.

The inferior performance of these features on the Kwyjibo dataset is not our

expectation. However, by looking into the segmentation failures and imperfect but

successful segmentations in both the Tomacco and Kwyjibo datasets, we believe that the

poor performances of the features on the Kwyjibo dataset is explainable and reasonable.

In total there are eight segmentation failures and two imperfect segmentations in the 203

Kwyjibo datasets, while there are nine segmentation failures and six imperfect

segmentations in the 48 Tomacco datasets by visual inspection. The percentage of

57

failures in the Tomacco dataset is 18.8%, while this ratio in the Kwyjibo dataset is only

3.9%. Also, the percentage of imperfect segmentations in the Tomacco dataset is 12.5%,

while this ratio is only 1.0% in the Kwyjibo dataset. Since the Tomacco and Kwyjibo

datasets are inherently the same considering they were acquired using the same sequences

and MR scanners, we can arrive at the conclusion that the performance of the processing

pipelines used on Kwyjibo is much better than that used on Tomacco. Since the two

datasets were processed using the same segmentation pipeline (CPSeg) but different

preprocessing pipelines (CATNAPs), we suspect that the CATNAP-v2 algorithm used on

Kwyjibo performs better than CATNAP-v1 algorithm used on Tomacco.

This hypothesis is consistent with our visual inspections of segmentation results

in the two datasets. Generally we found more abnormal brain masks and more abnormal

linear Westin index in the Kwyjibo dataset than those in the Tomacco dataset. Among the

nine failures in the Tomacco dataset, eight failures suffer from inferior quality of the

linear Westin index. For example, as shown in Figure 4.2(j), the linear Westin index of

the failure of at1103 is very abnormal. Also, the CP structures in the linear Westin index

are also incomplete and very abnormal. Similarly, the SCPs in the linear Westin indices

of the failures of at 1034 and at 1049, shown in Figures 4.2(b) and (c), are too dim to be

recognized easily. In contrast, only three failures, as shown in Figures 5.1(e), (f), and (h),

suffer from inferior data quality issues among the eight failures in the Kwyjbo dataset.

Thus, the higher correlations between the segmentation failures and the inferior diffusion

tensor parameters in the Tomacco dataset can explain why data quality features work

better on Tomacco than on Kwyjibo.

We also found in Figures 5.2, 5.3, and 5.4 that the medians of surface areas of the

58

CPs, the means and standard deviations of the MD, and the means and standard

deviations of the three Westin indices are statistically different between the Tomacco and

Kwyjibo datasets with a significance level 0.05. The medians of the other features

between the two datasets are statistically the same with the same significance level. This

is not what we expected since we have assumed that the calculated features on the two

datasets are comparable. Since the categories and populations of the neurological

diagnosis influencing the cerebellar peduncles between the two datasets are different, the

differences in features of the MD and Westin indices may be sensible. However, the

significant differences in surface areas of the CPs between the two datasets cannot be

explained since the corresponding volumes of the CPs are statistically the same. As we

mentioned before, we used CATNAP-v2 on Kwyjibo because it was used on Kwyjibo in

Ye et al. (2015). Thus, given the preceding observations, to guarantee the validity of

merging Tomacco and Kwyjibo together, we should reprocess the Tomacco dataset using

CATNAP-v2.

Before the CATNAP-v2 algorithm can be adopted formally for the Tomacco

dataset, we must verify it. In next section, we verify the CATNAP-v2 quantitatively and

qualitatively.

62

Table 5.2 Outlier detection results by selected features on the Kwyjibo dataset. The top

eight subjects were manually categorized as segmentation failures in this dataset.

AT1219 AT1556 AT1275 AT1061_2008 AT1061_2009 AT1532 AT1569 AT1594 TP/rate* FP/rate*surface/area 1 1 1 1 1 1 1 1 100.0% 6.2%volume 1 1 1 1 1 1 1 1 100.0% 5.1%std/of/Cp 1 0 0 0 0 0 0 0 12.5% 2.6%sym_BM 0 0 1 0 0 0 0 0 12.5% 2.1%mean/Cp 1 0 0 0 0 0 0 0 12.5% 0.5%mean/Cl 0 0 0 0 0 0 0 0 0.0% 1.0%std/of/Cl 0 0 0 0 0 0 0 0 0.0% 1.0%std/of/Cs 0 0 0 0 0 0 0 0 0.0% 1.0%mean/FA 0 0 0 0 0 0 0 0 0.0% 0.5%mean/Cs 0 0 0 0 0 0 0 0 0.0% 0.5%std/of/FA 0 0 0 0 0 0 0 0 0.0% 0.5%std/of/MD 0 0 0 0 0 0 0 0 0.0% 0.5%v_rightBM 0 0 0 0 0 0 0 0 0.0% 0.5%v_BM 0 0 0 0 0 0 0 0 0.0% 0.0%

v_leftBM 0 0 0 0 0 0 0 0 0.0% 0.0%mean/MD 0 0 0 0 0 0 0 0 0.0% 0.0%Note:/1/means/the/data/is/detected/as/an/outlier;/0/means/the/data/is/not/detected/as/an/outlier*TP:True/Positive;/FP*:False/Positive

63

5.3 Verification of the CATNAP-v2 Algorithm

The CATNAP-v1 and the CATNAP-v2 algorithms have three main differences.

First, the parameters of the registration processes are set differently. In CATNAP-v1, the

degrees of freedom for the registration module (Optimized Automated Registration), is set

as “Affine -12” and for the other registration module (File Collection Efficient

Registration), is set as “Rigid-6”. While in CATNAP-v2, the degrees of freedom for the

former one is “Rigid-6” and for the latter one is “Affine -12”. Second, the inputs to

former registration module (Optimized Automated Registration) in the two pipelines are

different. In CATNAP- v1, the two inputs to this module are an unstripped volume and a

mean B0 volume. However, in CATNAP-v2, the two inputs are a stripped volume using

a skull-stripping module (SPECTRE) and a mean B0 volume. Third, the versions of the

SPECTRE are different: CATNAP-v1 uses SPECTRE 2009 while CATNAP-v2 uses

SPECTRE 2010. Visual inspections show that SPECTRE 2010 in CATNAP-v2 used for

Kwyjibo generated almost no abnormal brain masks. On the other hand, SPECTRE 2009

in CATNAP-v1 generated several very abnormal brain masks in the Tomacco dataset.

This is evidence of the superiority of using CATNAP-v2 rather than CATNAP-v1.

Next, we reran the Tomacco dataset using CATNAP-v2 together with CPSeg and

compared the volumes of 30 segmentations which were successful segmentations both by

using CATNAP-v1 and CATNAP-v2. The boxplots and histograms of the volume

differences of six CPs of 30 Tomacco segmentations using the two CATNAPs are shown

in Figures 5.7 and 5.8. Boxplots in Figure 5.7 show that the volumes of the rSCP and

MCP by using CATNAP-v2 are larger than those using CATNAP-v1 and the differences

64

are statistically significant since the notches do not overlap with zero. The volumes of the

other four CPs by using the two CATNAPs are statistically the same.

Next we visually checked the segmentation results using the two CATNAPs. We

first resampled the Tomacco segmentations (from CATNAP-v1) from the original 1mm

isotropic resolution to a 0.828mm isotropic grid. Then we registered the resampled

segmentations to the corresponding segmentations using CATNAP-v2 with a 0.828mm

isotropic resolution. Then we subtracted the registered segmentations to the

segmentations using CATNAP-v2 and overlaid the subtracted image onto the

corresponding MPRAGE images. Compared with some MCPs in the segmentations

results using CATNAP-v1, we found that those using CATNAP-v2 are more similar to

the MCPs in MPRAGE images, as shown in Figure 5.9.

In summary, segmentation results using the two CATNAPs are statistically

different. There were fewer abnormal brain masks in the Kwyjibo data using CATNAP-

v2 and more accurate segmented MCPs in the reprocessed Tomacco data using

CATNAP-v2 compared with those using CATNAP-v1. Thus we believe that CATNAP-

v2 performs better than CATNAP-v1 and decided to continue use of CATNAP-v2.

67

5.4 Reproduction of the Tomacco Dataset

In order to merge the Tomacco and Kwyjibo datasets, we reran the Tomacco

dataset using the CATNAP-v2 and the CPSeg (exactly the same pipelines used on the

Kwyjibo dataset). We then studied the performance of the automatic segmentation

algorithm on the reprocessed Tomacco dataset in the same way as we did on the Kwyjibo

dataset.

First we categorized the segmentations from the reprocessed Tomacco data as

segmentation failures or successful segmentations. Because of quality issues of their

MRPAGE images, we excluded two subjects (at1081 and at 1083) from the 48 Tomacco

data. Therefore, we reprocessed only 46 Tomacco datasets in this study. As shown in

Table 5.3, we found a total of four segmentation failures among the 46 Tomacco data.

Since they were previously shown in Figures 4.2(b), (c), (i), and (j) in Chapter 4, we do

not show images of them here.

68

Table 5.3 Information of the 46 subjects in the Tomacco dataset including ID, diagnoses,

gender, categories, and scores.

Number ID Diagnosis Gender Category Score Number ID Diagnosis Gender Category Score1 at1103 control M 1 5 24 at1020 control F 0 102 at1034 SCA6 F 1 5 25 at1021 control M 0 103 at1049 SCA6 F 1 5 26 at1022 cb F 0 104 at1078 cb+ F 1 6 27 at1023 cb M 0 105 at1002 cb M 0 10 28 at1024 nph M 0 106 at1016 cb M 0 10 29 at1025 cb F 0 107 at1046 cb M 0 10 30 at1026 control M 0 108 at1007 cb F 0 10 31 at1027 ?cb M 0 109 at1032 control F 0 10 32 at1028 cb+ M 0 1010 at1041 cb+ M 0 10 33 at1029 control F 0 1011 at1056 ?cb F 0 10 34 at1031 control F 0 1012 at1080 control F 0 10 35 at1033 sca6 F 0 1013 at1086 control M 0 10 36 at1036 cb+ F 0 1014 at1000 sca6 M 0 10 37 at1038 vest M 0 1015 at1003 cb M 0 10 38 at1040 fam17/control M 0 1016 at1005 fam17/cb F 0 10 39 at1043 cb M 0 1017 at1006 cb M 0 10 40 at1044 control F 0 1018 at1011 cb F 0 10 41 at1045 control M 0 1019 at1013 ?cb M 0 10 42 at1048 sca6 F 0 1020 at1014 cb M 0 10 43 at1060 cb M 0 1021 at1015 cb M 0 10 44 at1079 control F 0 1022 at1017 sca6 M 0 10 45 at1082 control F 0 1023 at1018 control F 0 10 46 at1084 control F 0 10

Note:LCategoryL1L:LsegmentationLfailure;LCategoryL0L:LsuccessfulLsegmentation

69

By using boxplots, outliers were detected in the 46 Tomacco datasets. The

boxplots with detected outliers are shown in Figures 5.10, 5.11, 5.12, 5.13, and 5.14 and

the outlier detection results are shown in Table 5.3. Consistent with the feature

performance results on the 48 Tomacco data using CATNAP-v1, the object features and

the data quality features work well on the 46 reprocessed Tomacco datasets except for the

mean and standard deviation of the MD. The brain mask features perform worse than the

other two kinds of features, which is also consistent with the previous results that used

CATNAP-v1 in the 48 Tomacco datasets.

Since the Tomacco datasets were reprocessed using the same pipeline as Kwyjibo,

we can merge them together for a new outlier detection study. With the manual

categorization of all the segmentation results on both datasets (as either successful or

failed), several supervised classification methods for outlier detection are used to develop

an automatic method to detect successful or failed segmentations. Details are described in

the next section.

73

Table 5.4 Outlier detection results by selected features of the reprocessed Tomacco

dataset. Four subjects were manually categorized as segmentation failures in this dataset.

at1034 at1049 at1078 at1103 TP,rate* FP,rate*volume 1 1 1 1 100.0% 4.8%

surface,area 1 1 1 1 100.0% 14.3%mean,FA 1 1 0 1 75.0% 0.0%std,of,FA 1 1 0 1 75.0% 0.0%mean,Cl 1 1 0 1 75.0% 2.4%std,of,Cl 1 1 0 1 75.0% 0.0%mean,Cs 1 1 0 1 75.0% 0.0%mean,MD 0 0 0 1 25.0% 4.8%mean,Cp 0 1 0 0 25.0% 0.0%std,of,Cp 0 0 0 1 25.0% 0.0%std,of,Cs 0 0 0 1 25.0% 0.0%sym_BM 0 0 1 0 25.00% 2.4%std,of,MD 0 0 0 0 0 2.4%v_BM 0 0 0 0 0 2.4%

v_leftBM 0 0 0 0 0 2.4%v_rightBM 0 0 0 0 0 2.4%Note:,1,means,an,outlier;,0,means,not,a,outlier*TP:True,Positive;,FP*:False,Positive

74

5.5 Outlier Detection using Classification Methods

In this section, we use supervised learning methods, as described in Chapter 2, to

develop classifier for detection of failed segmentations. With the manual categorization

(into success or failure) of the 46 Tomacco datasets (yielding four segmentation failures

and 42 successful segmentations) and the 203 Kwyjibo datasets (yielding eight

segmentation failures and 195 successful segmentations), we are able to train classifiers

for automatically failure detection. Considering the limited quantities of the segmentation

failures in both datasets, we decided to combine them together and use all the 12 failures

for training. We trained four binary classifiers—linear discriminant analysis (LDA),

logistic regression (LR), support vector machine (SVM), and random forest classifier

(RFC). The performances of the four classifiers were evaluated using a leave-one-out

cross-validation. Details are presented in the next sections.

5.5.1 The four classifiers

LDA is a method used in statistics, pattern recognition, and machine learning to

find a linear combination of features that separate two or more classes. Logistic

regression is similar to LDA. We used it for comparison with LDA.

SVM is a supervised learning model in machine learning for classification and

regression analysis. It can generate classifiers from poorly balanced data, which is often

the case in medical domains where abnormal data is rare or difficult to obtain. In SVMs,

the input data is projected to higher dimensional space by a kernel function to find a

hyperplane that distinguishes normal data and outliers. The kernel can be a linear dot

75

product, a polynomial function, a radial basis function, or a sigmoid function. We

adopted the linear dot product kernel since it performed best among the four kinds of

kernels based on our trial experiments.

Random forest classifier, an ensemble classification method, is a collection of

decisions trees that usually performs better than a single decision tree. It was introduced

detail in Chapter 2. No more additional description is provided here. We used 100 trees in

implementing this classifier and averaged the misclassification rate, true positive rate,

and false positive over 20 runs.

5.5.2 Training sets

To train the classifiers, we must select a representative training set. Choosing a

proper training set is very critical since it directly affects the generalization and

classification accuracy of the classifiers. In our datasets, we have only four segmentation

failures in the Tomacco dataset and eight segmentation failures in the Kwyjibo dataset.

To provide the most information about failures in training, we used all 12 segmentation

failures in our training sets. Since failure detection is a binary classification task, we must

also include successful segmentations in the training set. There are a total of 237

successful segmentations in the two datasets. We built two training sets to evaluate. The

first one consists of 12 segmentation failures and all of the 237 successful segmentations.

The second one consists of 12 segmentation failures and only 24 randomly selected

successful segmentations among the 237 segmentations.

The features we used are the same as before: the object features, the data quality

features, and the brain masks features. The feature vector is F = [V,S,FA,MD,WI,BM ] ,

76

where V and S are the object features (volumes and surface areas of the six CPs), FA,

MD, and WI are the data quality features (means and standard deviations of FA, MD, and

the three Westin indices, respectively), and BM are the brain masks features (the volumes

of the left, right, and whole brain masks as well as the symmetry of the brain masks). A

total of 26 numerical features are used. Considering the range of volumes and surface

areas are much larger than that of the data quality oriented features, we normalized all the

feature values before training the classifiers. For each feature, the normalization was

performed by subtracting the mean of the feature values from each data and then dividing

the standard deviation of the feature values.

5.5.3 Performance evaluation

We evaluated the four classifiers using leave-one-out cross-validation on each of

the two training sets. When a subject of one training set was tested (the segmentation

result), the remaining subjects in this set were used in the training phase (which means,

for example, that when a failure is being evaluated only 11 failures are used in the

training set). The misclassification rate (MCR), true positive (TP) and false positive (FP)

rates as well as the numbers of true and false positives on both training sets are shown in

Table 5.5.

Table 5.5 shows that in the first (larger) training set, the RFC performs best and

the LR performs worst among the four classifiers. The performances of the LDA and the

linear SVM are very similar. In the second training set, the LDA, the LR, and the linear

SVM perform similarly and the RFC’s performance is just a little bit worse than other

three. Considering that the second training set only contains 24 randomly selected

77

successful segmentations, the inferior performance of RFC on this training set is not

surprising. It is hard to come to a conclusion about which classifier performs best. Taking

both performances on the two training sets, the total numbers of true positives and false

positives are 19 and 4 using LDA, 18 and 7 using LR, 17 and 2 using SVM, and 19 and 3

using RFC, respectively. In summary, the LR performs worst among the four classifiers

and the other three perform comparably.

78

Table 5.5 Performance comparison of the four classifiers (LDA, LR, SVM, and RFC) on

the combined Tomacco and Kwyjibo dataset.

MCR # TP # FP TP rate FP rateLDA 0.028 8 3 0.667 0.013LR 0.044 7 6 0.583 0.025

SVM(linear) 0.028 7 2 0.583 0.008RFC* 0.021(0.002) 9 2 0.75(0) 0.009(0.002)

MCR # TP # FP TP rate FP rateLDA 0.056 11 1 0.917 0.042LR 0.056 11 1 0.917 0.042

SVM(linear) 0.056 10 0 0.833 0RFC* 0.075(0.013) 10 1 0.863(0.041) 0.044(0.009)

* 100 trees, mtry = 4, average on 20 runs

Training set 1

Training set 2

79

Chapter 6 Conclusions and Future Work

In this thesis, we presented an approach to quality assurance using outlier

detection for the automatic cerebellar peduncle segmentation algorithm presented by Ye.

et al. (Chuyang Ye et al., 2015). In this chapter, we summarize the main contributions of

this thesis and suggest future improvements.

6.1 Main Contributions

Three main contributions are made in this thesis. First, we validated a new

cerebellar peduncle segmentation pipeline (CPSeg) against the corresponding old

pipeline (RFC+MGDM) which was used in the paper reporting the peduncle

segmentation algorithm. Dice coefficients (Dice, 1945) and average surface distances

(ASDs) between nine segmentation results and corresponding manual delineations were

computed on both pipelines. A paired Student’s t -test and a Wilcoxon signed-rank test

(significance level 0.05) were performed with respect to the Dice coefficients and the

ASDs, respectively. Testing results on the Dice coefficients and the ASDs show that the

performance between the CPSeg and the RFC+MGDM are not statistically different.

Furthermore, results of the two tests on the Dice coefficients indicate that the CPSeg

segments the dSCP better.

Second, we validated a preprocessing pipeline, CATNAP, against a slightly

different version of this pipeline. We call the old version CATNAP-v1 and the new

version CATNAP-v2. This validation was necessary since the Tomacco and Kwyjibo

80

datasets were processed using different CATNAP versions, but we wanted to combine

them for outlier detection using supervised classification. We conducted both quantitative

and visual inspection of the segmentation results in the Tomacco dataset using the two

CATNAPs as the preprocessing step. Results show that the CATNAP-v2 generates

statistically different volumes of the MCPs compared with that of CATNAP-v1. Visual

inspection of the segmentation results shows that CATNAP-v2 performs slightly better

on segmenting the MCPs.

Third, we studied the performance of the automatic segmentation algorithm using

box-whisker plots and four supervised classifiers for failure detection. We first manually

categorized the segmentation results in the Tomacco dataset as a segmentation failure or

a successful segmentation. We then designed three kinds of features of the image data

from the categorized failures in the Tomacco dataset for detecting potential segmentation

failures automatically. With these features, we detected outliers using boxplots and

evaluated each feature’s performance. Next, we did a similar outlier detection study using

box-whisker plots on the Kwyjibo dataset with 203 subjects. This dataset only has eight

segmentation failures and two imperfect but successful segmentations. Then we

reprocessed the Tomacco dataset using the same algorithm pipeline as used for Kwyjibo

and merged the two datasets together. With the manually categorized segmentation

results as training data, we detected failures automatically by using four classifiers—

linear discriminant analysis (LDA), logistic regression (LR), support vector machine

(SVM), and random forest classifier (RFC). We evaluated the performance of each

classifier using a leave-one-out cross-validation and computed the true positives and false

positives of each classifier. Our results show that the performances of the LDA, the linear

81

SVM and the RFC are not very different and LR performs worse than the other three

classifiers.

6.2 Future Work

In sum, this project can be best described as an exploration on quality assurance

for the performance of the automatic segmentation algorithm using outlier detection. For

example, before looking into the segmentation failures, we did not have a clear mind

about the quantity, appearance, and the reasons why they became failures. Thus we did

not plan to use classification methods for outlier detection until we had enough

information about the failures in the two datasets. So the work in this project is just a

small starting point on quality assurance using outlier detection for the proposed

segmentation algorithm. Many improvements should be done in the future.

First, more effective features to improve the classification accuracy and

distinguish the outlier failures and naturally occurring derivations should be extracted.

Features for registration accuracy and correctness and the quality of the PEVs in the

CPSeg pipeline can be considered in the future. Furthermore, the data quality features

such as the means and standard deviations of FA, the MD, and the three Westin indices of

the whole brain do not perform as well as the object features like the volumes and surface

areas of the CPs. How to modify these features and make them more effective remains to

be studied. As well, although volumes and surface areas of the CPs can effectively detect

true segmentation failures, the false positive rates of them are also relatively higher than

other features since there are naturally occurring subjects with extreme volumes.

Boxplots and the supervised classifiers cannot necessarily distinguish extreme

82

data/anatomy from algorithm failures. If alterative features can be developed to replace

the two object features, this problem may be solved.

Second, since there are a total of 12 segmentation failures in the combined

Tomacco and Kwyjibo dataset, the generalization of the four classifiers trained on the

two datasets needs to be further studied. Therefore, we probably should apply these

classifiers on other separate datasets with other diagnoses to evaluate their accuracies.

Lastly, the general approach that we have pursued here can be applied to other

medical image segmentation algorithms. While our application is very specific (to the

segmentation of the cerebellar peduncles) there are numerous automatic segmentation

algorithms used on medical imaging data in neuroscience and in many other fields of

study which would benefit from automatic quality assurance. Our approach suggests an

overall methodology that could be adapted and use in many other applications.

83

Bibliography

Asman, A. J., Lauzon, C. B., & Landman, B. A. (2013). Robust Inter-Modality Multi-

Atlas Segmentation for PACS-based DTI Quality Control. Proc SPIE Int Soc Opt

Eng, 8674. doi: 10.1117/12.2007587

Avants, B. B., Epstein, C. L., Grossman, M., & Gee, J. C. (2008). Symmetric

diffeomorphic image registration with cross-correlation: evaluating automated

labeling of elderly and neurodegenerative brain. Medical image analysis, 12(1),

26-41.

Bazin, P. L., Ye, C., Bogovic, J. A., Shiee, N., Reich, D. S., Prince, J. L., & Pham, D. L.

(2011). Direct segmentation of the major white matter tracts in diffusion tensor

images. Neuroimage, 58(2), 458-468. doi: 10.1016/j.neuroimage.2011.06.020

Bogovic, J. A., Prince, J. L., & Bazin, P. L. (2013). A Multiple Object Geometric

Deformable Model for Image Segmentation. Comput Vis Image Underst, 117(2),

145-157. doi: 10.1016/j.cviu.2012.10.006

Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.

Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-

297.

De Maesschalck, R., Jouan-Rimbaud, D., & Massart, D. L. (2000). The mahalanobis

distance. Chemometrics and intelligent laboratory systems, 50(1), 1-18.

Dice, L. R. (1945). Measures of the Amount of Ecologic Association Between Species.

Ecology, 26(3), 297-302. doi: 10.2307/1932409

84

Dreiseitl, S., Osl, M., Scheibbock, C., & Binder, M. (2010). Outlier Detection with One-

Class SVMs: An Application to Melanoma Prognosis. AMIA Annu Symp Proc,

2010, 172-176.

Gallichan, D., Scholz, J., Bartsch, A., Behrens, T. E., Robson, M. D., & Miller, K. L.

(2010). Addressing a systematic vibration artifact in diffusion-weighted MRI.

Hum Brain Mapp, 31(2), 193-202. doi: 10.1002/hbm.20856

Grubbs, F. E. (1969). Procedures for Detecting Outlying Observations in Samples.

Technometrics, 11(1), 1-21. doi: 10.2307/1266761

Hao, X., Zygmunt, K., Whitaker, R. T., & Fletcher, P. T. (2014). Improved segmentation

of white matter tracts with adaptive Riemannian metrics. Med Image Anal, 18(1),

161-175. doi: 10.1016/j.media.2013.10.007

Hodge, V., & Austin, J. (2004). A Survey of Outlier Detection Methodologies. Artificial

Intelligence Review, 22(2), 85-126. doi: 10.1007/s10462-004-4304-y

Ihalainen, T., Sipila, O., & Savolainen, S. (2004). MRI quality control: six imagers

studied using eleven unified image quality parameters. Eur Radiol, 14(10), 1859-

1865. doi: 10.1007/s00330-004-2278-4

King, A. D., Walshe, J. M., Kendall, B. E., Chinn, R. J., Paley, M. N., Wilkinson, I. D., . .

. Hall-Craggs, M. A. (1996). Cranial MR imaging in Wilson's disease. American

Journal of Roentgenology, 167(6), 1579-1584. doi: 10.2214/ajr.167.6.8956601

Knutsson, H. (1985). Producing a continuous and distance preserving 5-D vector

representation of 3-D orientation.

Landman, B. A., Farrell, J. A., Patel, N., Mori, S., & Prince, J. L. (2007). DTI fiber

tracking: the importance of adjusting DTI gradient tables for motion correction.

85

CATNAP-a tool to simplify and accelerate DTI analysis. Paper presented at the

Proc. org human brain mapping 13th annual meeting.

Laurikkala, J., Juhola, M., Kentala, E., Lavrac, N., Miksch, S., & Kavsek, B. (2000).

Informal identification of outliers in medical data. Paper presented at the Fifth

International Workshop on Intelligent Data Analysis in Medicine and

Pharmacology.

Lauzon, C. B., Asman, A. J., Esparza, M. L., Burns, S. S., Fan, Q., Gao, Y., . . .

Landman, B. A. (2013). Simultaneous analysis and quality assurance for diffusion

tensor imaging. PLoS One, 8(4), e61737. doi: 10.1371/journal.pone.0061737

Lawes, I. N., Barrick, T. R., Murugam, V., Spierings, N., Evans, D. R., Song, M., &

Clark, C. A. (2008). Atlas-based segmentation of white matter tracts of the human

brain using diffusion tensor tractography and comparison with classical

dissection. Neuroimage, 39(1), 62-79. doi: 10.1016/j.neuroimage.2007.06.041

Le Bihan, D., Mangin, J.-F., Poupon, C., Clark, C. A., Pappata, S., Molko, N., &

Chabriat, H. (2001). Diffusion tensor imaging: Concepts and applications.

Journal of Magnetic Resonance Imaging, 13(4), 534-546. doi: 10.1002/jmri.1076

Lucas, B. C., Bogovic, J. A., Carass, A., Bazin, P.-L., Prince, J. L., Pham, D. L., &

Landman, B. A. (2010). The Java Image Science Toolkit (JIST) for rapid

prototyping and publishing of neuroimaging software. Neuroinformatics, 8(1), 5-

17.

Magalhaes, A. C. A., Caramelli, P., Menezes, J. R., Lo, L. S., Bacheschi, L. A., Barbosa,

E. R., . . . Magalhaes, A. (1994). Wilson's disease: MRI with clinical correlation.

Neuroradiology, 36(2), 97-100. doi: 10.1007/BF00588068

86

Mai, S. T., Goebl, S., & Plant, C. (2012). A Similarity Model and Segmentation

Algorithm for White Matter Fiber Tracts. 12th Ieee International Conference on

Data Mining (Icdm 2012), 1014-1019. doi: 10.1109/Icdm.2012.95

Mayer, A., Zimmerman-Moreno, G., Shadmi, R., Batikoff, A., & Greenspan, H. (2011).

A supervised framework for the registration and segmentation of white matter

fiber tracts. IEEE Trans Med Imaging, 30(1), 131-145. doi:

10.1109/TMI.2010.2067222

Mori, S., Wakana, S., Van Zijl, P. C., & Nagae-Poetscher, L. (2005). MRI atlas of human

white matter (Vol. 16): Am Soc Neuroradiology.

Murata, Y., Kawakami, H., Yamaguchi, S., Nishimura, M., Kohriyama, T., Ishizaki, F., .

. . Nakamura, S. (1998). Characteristic magnetic resonance imaging findings in

spinocerebellar ataxia 6. Archives of neurology, 55(10), 1348-1352.

Nicoletti, G., Fera, F., Condino, F., Auteri, W., Gallo, O., Pugliese, P., . . . Zappia, M.

(2006). MR Imaging of middle cerebellar peduncle width: differentiation of

multiple system atrophy from Parkinson disease 1. Radiology, 239(3), 825-830.

Perrini, P., Tiezzi, G., Castagna, M., & Vannozzi, R. (2013). Three-dimensional

microsurgical anatomy of cerebellar peduncles. Neurosurg Rev, 36(2), 215-224;

discussion 224-225. doi: 10.1007/s10143-012-0417-y

Roberts, S., & Tarassenko, L. (1994). A probabilistic resource allocating network for

novelty detection. Neural Computation, 6(2), 270-284.

Rodrigues, G., Louie, A., Videtic, G., Best, L., Patil, N., Hallock, A., . . . Bauman, G.

(2012). Categorizing segmentation quality using a quantitative quality assurance

algorithm. J Med Imaging Radiat Oncol, 56(6), 668-678. doi: 10.1111/j.1754-

87

9485.2012.02442.x

Saenz, D., Kim, H., Chen, J., Stathakis, S., & Kirby, N. (2015). SU-E-J-97: Quality

Assurance of Deformable Image Registration Algorithms: How Realistic Should

Phantoms Be? Med Phys, 42(6), 3286. doi: 10.1118/1.4924184

Sharpe, M., & Brock, K. K. (2008). Quality assurance of serial 3D image registration,

fusion, and segmentation. International Journal of Radiation Oncology* Biology*

Physics, 71(1), S33-S37.

Sivaswamy, L., Kumar, A., Rajan, D., Behen, M., Muzik, O., Chugani, D., & Chugani,

H. (2010). A diffusion tensor imaging study of the cerebellar pathways in children

with autism spectrum disorder. J Child Neurol, 25(10), 1223-1231. doi:

10.1177/0883073809358765

Tax, D., Ypma, A., & Duin, R. (1999). Support vector data description applied to

machine vibration analysis. Paper presented at the Proc. 5th Annual Conference

of the Advanced School for Computing and Imaging (Heijen, NL.

Wang, F., Sun, Z., Du, X., Wang, X., Cong, Z., Zhang, H., . . . Hong, N. (2003). A

diffusion tensor imaging study of middle and superior cerebellar peduncle in male

patients with schizophrenia. Neuroscience letters, 348(3), 135-138.

Wang, Z. J., Seo, Y., Chia, J. M., & Rollins, N. K. (2011). A quality assurance protocol

for diffusion tensor imaging using the head phantom from American College of

Radiology. Med Phys, 38(7), 4415-4421.

Westin, C.-F., Peled, S., Gudbjartsson, H., Kikinis, R., & Jolesz, F. A. (1997).

Geometrical diffusion measures for MRI from tensor basis analysis. Paper

presented at the Proceedings of ISMRM.

88

Ye, C., Bazin, P.-L., Bogovic, J. A., Ying, S. H., & Prince, J. L. (2012). Labeling of the

cerebellar peduncles using a supervised Gaussian classifier with volumetric tract

segmentation. Paper presented at the SPIE Medical Imaging.

Ye, C., Bogovic, J. A., Ying, S. H., & Prince, J. L. (2013). Segmentation of the Complete

Superior Cerebellar Peduncles Using a Multi-Object Geometric Deformable

Model. Proc IEEE Int Symp Biomed Imaging, 2013, 49-52. doi:

10.1109/ISBI.2013.6556409

Ye, C., Yang, Z., Ying, S., & Prince, J. (2015). Segmentation of the Cerebellar Peduncles

Using a Random Forest Classifier and a Multi-object Geometric Deformable

Model: Application to Spinocerebellar Ataxia Type 6. Neuroinformatics, 13(3),

367-381. doi: 10.1007/s12021-015-9264-7

Ying, S., Landman, B., Chowdhury, S., Sinofsky, A., Gambini, A., Mori, S., . . . Prince,

J. (2009). Orthogonal diffusion-weighted MRI measures distinguish region-

specific degeneration in cerebellar ataxia subtypes. Journal of Neurology,

256(11), 1939-1942. doi: 10.1007/s00415-009-5269-1

Yung, J., Stefan, W., Reeve, D., & Stafford, R. J. (2015). TU-F-CAMPUS-I-05: Semi-

Automated, Open Source MRI Quality Assurance and Quality Control Program

for Multi-Unit Institution. Med Phys, 42(6), 3647. doi: 10.1118/1.4925830

Zhang, S., Correia, S., & Laidlaw, D. H. (2008). Identifying white-matter fiber bundles in

DTI data using an automated proximity-based fiber-clustering method.

Visualization and Computer Graphics, IEEE Transactions on, 14(5), 1044-1053.

Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH: an efficient data clustering

method for very large databases. Paper presented at the ACM SIGMOD Record.

89

Vita

Ke Li was born on May 16, 1990 in Yulin, Shaanxi Province, China. She received

her Bachelor of Engineering degree in Optical Engineering with a speciality in

Information Engineering at Zhejiang University, China in July 2013. She then began her

studies toward her Master of Science in Engineering (M.S.E) degree in Biomedical

Engineering at Johns Hopkins University in August 2013. She conducted her research in

the Image Analysis and Communications Lab under the direction of Dr. Jerry L. Prince.