[ieee 2013 12th international conference on machine learning and applications (icmla) - miami, fl,...

6
Weak Segmentations and Ensemble Learning to Predict Semantic Ratings of Lung Nodules Ethan Smith Department of Information and Computer Sciences University of Hawaii at Manoa Honolulu, USA Email: [email protected] Patrick Stein, Jacob Furst, & Daniela Stan Raicu College of Computing and Digital Media DePaul University Chicago, USA Email: [email protected] Abstract—Computer-aided diagnosis (CAD) can be used as second readers in the imaging diagnostic process. Typically to create a CAD system, the region of interest (ROI) has to be first detected and then delineated. This can be done either manually or automatically. Given that manually delineating ROIs is a time consuming and costly process, we propose a CAD system based on multiple computer-derived weak segmentations (WSCAD) and show that its diagnosis performance is at least as good as the predictions developed using manual radiologist segmentations. The proposed CAD system extracts a set of image features from the weak segmentations and uses them in an ensemble of classification algorithms to predict semantic ratings such as malignancy. These automated results are compared against a reference truth based on ratings and segmentations provided by radiologists to determine if it is necessary to obtain manual radiologist segmentations in order to develop a CAD. By developing a pair of CADs using the Lung Image Database Consortium (LIDC) data, we show that WSCADs are at least as accurate in predicting semantic ratings as CADs based on radiologist segmentation. Keywordsclassification; computer aided diagnosis; crowd- sourcing; ensemble learning; lung cancer; segmentation I. I NTRODUCTION Lung cancer is the most deadly type of cancer and suc- cessful treatment depends on early detection [1]. The National Cancer Institute predicts 228,190 new cases of lung and bronchus cancer and 159,480 deaths for 2013 alone, surpassing both breast and prostate cancers’ estimated deaths by more than 100,000 [2]. To combat this destructive disease, radiologists use CAD systems as a second opinion to assist with detecting and characterizing potentially cancerous nodules. In previous work, a CAD system has been employed as a second opinion for both board certified radiologists and residents. Awai et al. concluded that the use of a CAD system as a second opinion significantly aided the ability of both the board certified radiologists and residents to detect pulmonary nodules in chest computed tomography (CT) scans [3]. The CAD systems considered in this paper assist with the characterization of lung nodules by predicting their semantic characteristics. These semantic characteristics include ratings for the categories of texture, spiculation, lobulation, sphericity, subtlety, margin and likelihood of malignancy. To generate these predictions, previous research used region contours man- ually determined by radiologists as a basis to derive low-level image features [4]. From these image features, predictions of semantic characteristics were made. To determine if there is a need for radiologists’ to con- tinue to draw manual segmentations for these predictions, we propose a CAD system based on multiple computer-derived weak segmentations (WSCAD) and show that its predictive accuracy is comparable with the predictions obtained using a CAD system based on the manual radiologists’ segmentations. Weak segmentation has several benefits over manual seg- mentation by radiologists. The manual segmentations are time consuming and costly. In addition, it has been shown to be a subjective process with high variability between radi- ologists [5]. Our proposed system would replace subjective opinions with a precisely-defined algorithmic process. It also has considerable cost-savings potential, by substituting costly radiologists’ time with the trivial cost of digital computation. Our inspiration for using WSCADs is crowdsourcing, a new and popular form of research utilizing the power of mul- tiple non-experts attempting to solve or contribute to a single problem. However, there are multiple issues with a traditional crowdsource based approach such as privacy and reliability challenges. While the system could be built to make sure patients’ CT scans remain anonymous, releasing active patient records could be an invasion of privacy. Furthermore, a CAD system would have to collect a set of image segmentations from a crowd in a traditional crowdsource model. This is also problematic because there is no guarantee that a crowd will always be present to accomplish the segmentation, potentially rendering the CAD system useless. In our machine-based method, all patient information and segmentation production will be encapsulated in a closed system only accessible by medical professionals. Addition- ally, a machine based method depends upon computers to generate the segmentations guaranteeing that a set of weak segmentations can be produced. Finally, when newer and improved segmentation algorithms are developed, they can be incorporated into the system for improved results. The remainder of this paper is organized as follows. Section II discusses related work in weak segmentation, ensemble learning, and crowdsourcing. Section III is an overview of the methods used in creating our WSCAD and a manual radiologists’ contour based CAD to serve as the reference truth. Section IV discusses the results of the WSCAD and the reference truth and compares their accuracies. Section 2013 12th International Conference on Machine Learning and Applications 978-0-7695-5144-9/13 $26.00 © 2013 IEEE DOI 10.1109/ICMLA.2013.170 519 2013 12th International Conference on Machine Learning and Applications 978-0-7695-5144-9/13 $31.00 © 2013 IEEE DOI 10.1109/ICMLA.2013.170 519 2013 12th International Conference on Machine Learning and Applications 978-0-7695-5144-9/13 $31.00 © 2013 IEEE DOI 10.1109/ICMLA.2013.170 519 2013 12th International Conference on Machine Learning and Applications 978-0-7695-5144-9/13 $31.00 © 2013 IEEE DOI 10.1109/ICMLA.2013.170 519

Upload: daniela-stan

Post on 24-Mar-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2013 12th International Conference on Machine Learning and Applications (ICMLA) - Miami, FL, USA (2013.12.4-2013.12.7)] 2013 12th International Conference on Machine Learning

Weak Segmentations and Ensemble Learning toPredict Semantic Ratings of Lung Nodules

Ethan SmithDepartment of Information and Computer Sciences

University of Hawaii at Manoa

Honolulu, USA

Email: [email protected]

Patrick Stein, Jacob Furst, & Daniela Stan RaicuCollege of Computing and Digital Media

DePaul University

Chicago, USA

Email: [email protected]

Abstract—Computer-aided diagnosis (CAD) can be used assecond readers in the imaging diagnostic process. Typically tocreate a CAD system, the region of interest (ROI) has to befirst detected and then delineated. This can be done eithermanually or automatically. Given that manually delineating ROIsis a time consuming and costly process, we propose a CADsystem based on multiple computer-derived weak segmentations(WSCAD) and show that its diagnosis performance is at leastas good as the predictions developed using manual radiologistsegmentations. The proposed CAD system extracts a set of imagefeatures from the weak segmentations and uses them in anensemble of classification algorithms to predict semantic ratingssuch as malignancy. These automated results are comparedagainst a reference truth based on ratings and segmentationsprovided by radiologists to determine if it is necessary to obtainmanual radiologist segmentations in order to develop a CAD.By developing a pair of CADs using the Lung Image DatabaseConsortium (LIDC) data, we show that WSCADs are at leastas accurate in predicting semantic ratings as CADs based onradiologist segmentation.

Keywords—classification; computer aided diagnosis; crowd-sourcing; ensemble learning; lung cancer; segmentation

I. INTRODUCTION

Lung cancer is the most deadly type of cancer and suc-cessful treatment depends on early detection [1]. The NationalCancer Institute predicts 228,190 new cases of lung andbronchus cancer and 159,480 deaths for 2013 alone, surpassingboth breast and prostate cancers’ estimated deaths by morethan 100,000 [2].

To combat this destructive disease, radiologists use CADsystems as a second opinion to assist with detecting andcharacterizing potentially cancerous nodules. In previous work,a CAD system has been employed as a second opinion for bothboard certified radiologists and residents. Awai et al. concludedthat the use of a CAD system as a second opinion significantlyaided the ability of both the board certified radiologists andresidents to detect pulmonary nodules in chest computedtomography (CT) scans [3].

The CAD systems considered in this paper assist with thecharacterization of lung nodules by predicting their semanticcharacteristics. These semantic characteristics include ratingsfor the categories of texture, spiculation, lobulation, sphericity,subtlety, margin and likelihood of malignancy. To generatethese predictions, previous research used region contours man-ually determined by radiologists as a basis to derive low-level

image features [4]. From these image features, predictions ofsemantic characteristics were made.

To determine if there is a need for radiologists’ to con-tinue to draw manual segmentations for these predictions, wepropose a CAD system based on multiple computer-derivedweak segmentations (WSCAD) and show that its predictiveaccuracy is comparable with the predictions obtained using aCAD system based on the manual radiologists’ segmentations.

Weak segmentation has several benefits over manual seg-mentation by radiologists. The manual segmentations are timeconsuming and costly. In addition, it has been shown tobe a subjective process with high variability between radi-ologists [5]. Our proposed system would replace subjectiveopinions with a precisely-defined algorithmic process. It alsohas considerable cost-savings potential, by substituting costlyradiologists’ time with the trivial cost of digital computation.

Our inspiration for using WSCADs is crowdsourcing, anew and popular form of research utilizing the power of mul-tiple non-experts attempting to solve or contribute to a singleproblem. However, there are multiple issues with a traditionalcrowdsource based approach such as privacy and reliabilitychallenges. While the system could be built to make surepatients’ CT scans remain anonymous, releasing active patientrecords could be an invasion of privacy. Furthermore, a CADsystem would have to collect a set of image segmentationsfrom a crowd in a traditional crowdsource model. This is alsoproblematic because there is no guarantee that a crowd willalways be present to accomplish the segmentation, potentiallyrendering the CAD system useless.

In our machine-based method, all patient information andsegmentation production will be encapsulated in a closedsystem only accessible by medical professionals. Addition-ally, a machine based method depends upon computers togenerate the segmentations guaranteeing that a set of weaksegmentations can be produced. Finally, when newer andimproved segmentation algorithms are developed, they can beincorporated into the system for improved results.

The remainder of this paper is organized as follows. SectionII discusses related work in weak segmentation, ensemblelearning, and crowdsourcing. Section III is an overview ofthe methods used in creating our WSCAD and a manualradiologists’ contour based CAD to serve as the referencetruth. Section IV discusses the results of the WSCAD andthe reference truth and compares their accuracies. Section

2013 12th International Conference on Machine Learning and Applications

978-0-7695-5144-9/13 $26.00 © 2013 IEEE

DOI 10.1109/ICMLA.2013.170

519

2013 12th International Conference on Machine Learning and Applications

978-0-7695-5144-9/13 $31.00 © 2013 IEEE

DOI 10.1109/ICMLA.2013.170

519

2013 12th International Conference on Machine Learning and Applications

978-0-7695-5144-9/13 $31.00 © 2013 IEEE

DOI 10.1109/ICMLA.2013.170

519

2013 12th International Conference on Machine Learning and Applications

978-0-7695-5144-9/13 $31.00 © 2013 IEEE

DOI 10.1109/ICMLA.2013.170

519

Page 2: [IEEE 2013 12th International Conference on Machine Learning and Applications (ICMLA) - Miami, FL, USA (2013.12.4-2013.12.7)] 2013 12th International Conference on Machine Learning

V summarizes the conclusions of the study and section VIaddresses possible future work including generating additionalsegmentations and investigating algorithms to handle skewedclass distributions.

II. RELATED WORK

Several studies have been conducted on the weak segmen-tation of lung nodule and mammogram images. Zhang et al.developed an ensemble system in which multiple weak seg-mentations are used in an ensemble of classification algorithmsto predict malignant or benign masses in mammograms [6]. Zi-noveva et al. performed nodule segmentation and addressed theusage of observations with uncertain truth by using probabilitymaps computed using the multiple contours of suspected lungnodules supplied by a panel of radiologists. This was used tocreate a classifier that determined if a particular pixel belongedto the nodule [7].

Other work on image analysis employed crowdsourcing toannotate images [8], classify malaria infected red blood cells[9], and in the case of Amazon’s Mechanical Turk, collect seg-mentations and labels for sets of images [10]. Rashchian et al.used Amazon’s Mechanical Turk to gather image annotations(descriptive sentences of what is occurring in the image) fromthe dataset provided at the Visual Object Classes Challengein 2008. They discuss the challenges faced with collectingdata on such an open and noisy system [8]. Mavandadi etal. created a simple game to facilitate a crowdsource basedclassification of malaria infected red blood cells [9]. To acquirea final classification result from the gamers, Mavandadi et al.modeled the inputs received from gamers as a communicationsystem in which a gamer acts as a noisy Binary Channel [9].This is a well-established practice but requires classificationto be limited to two classes. These inputs are combined usingMaximum a Posteriori to generate a single classification labelbased on a collection of gamer’s transmitted classificationsfor a single red blood cell image [9]. Vijayanarasimhan andGrauman used Amazon’s Mechanical Turk system to acquiremultiple segmentations of the same image to determine an easyor hard label based on the average time taken on an image [10].

Multiple segmentations from the same ROI allows for anincreased number of ensemble learning techniques to be ap-plied to the problem. One ensemble learning method successfulin improving CAD accuracy is to divide samples accordingto meta features not directly related to the image features,such as patient age, and train a variety of classifiers on eachsubgroup. A winner takes all approach was then used to selectthe classifier with the highest accuracy for each subgroup[6]. Other methods of ensemble learning in current practiceinclude bagging [11] and boosting [12], which enhance theperformance of an individual classifier algorithm by training itrepeatedly on a single population using a variety of samplingsor weightings.

Some of the previous work on predicting semantic rat-ings of nodules from image features treats the radiologists’ratings as classes [4]. Work on the classification of ratingsin other domains, such as the Netflix Prize, have shown asignificant potential for improving prediction accuracies byinstead treating ratings as numerical values and applying aheterogeneous ensemble learning method known as stacked

generalization [13]–[15]. Stacked generalization is a multi-level ensemble learning method that uses the predictions madeby a collection of base level classifiers as input features andgenerates predictions with higher accuracy than any of theindividual classifiers. This method allows for dissimilar typesof machine learning algorithms to be used together but doesnot perform well on labels with a small number of possiblevalues, such as non-probabilistic classes. The LIDC labels areratings for categories such as the likelihood of malignancyand are given as one of five possible values ranging from 1to 5. The potential increase in prediction accuracy from usingstacked generalization is one of the motivations for treatingthese ratings as numerical values instead of classes in thisstudy.

III. METHODOLOGY

A. Dataset

The Lung Image Database Consortium (LIDC) contains CTscans of a patient’s lungs along with outlines of suspected lungnodules and radiologists’ evaluations of these nodules [16].From the LIDC dataset we used a total of 2,636 nodules with adiameter greater than 3 millimeters. Each of these nodules hadmultiple CT scans from different vertical planes. Only the slicethat contained the largest area of the nodule, as determinedby a single radiologist’s outline, was considered. For furtherinformation about the LIDC dataset, refer to [16].

Each image in the LIDC is accompanied with the nodule’ssegmentation contour and semantic ratings as determined bya panel of four radiologists. The segmentation contours arehard boundaries identifying the extent of the nodule withinthe image. The largest among these contours formed thefoundation of the reference truth by determining the pixelswithin the CT scan to be used for image feature extraction.The largest contour of each nodule also determined the extentsof the nodule’s bounding box segmented by the WSCAD.

The semantic ratings used to test and train the two CADsare descriptions of the nodules across nine categories: subtlety,sphericity, margin, lobulation, spiculation, texture, calcifica-tion, internal structure and malignancy. The categories ofcalcification and internal structure were not used in this studysince extreme uniformity among their values resulted in themproviding little information to the system. The removal of thesetwo categories brought the total number of semantic ratingcategories in this study to seven.

For each of the seven categories, the radiologists suppliedinteger value ratings ranging from 1 to 5 indicating thelikelihood or strength of the manifestation of that charac-teristic. Subtlety is the ease in detecting the nodule due tothe lack in contrast with the background. Sphericity describesthe roundness of the nodule’s shape. Margin is how clearlyedges of the nodule are defined. The lobulation categorydescribes the lack of lobe shapes in the nodule. The spiculationvalue indicates a lack of spike shaped protrusions. Texturedescribes the degree that the interior of the nodule is solid.Malignancy is the likelihood that the appearance of the noduleindicates a malignant nodule. For the specific definitions ofeach category’s ratings refer to [16].

While the panel included four radiologists, not everynodule was supplied with four contours and ratings. The

520520520520

Page 3: [IEEE 2013 12th International Conference on Machine Learning and Applications (ICMLA) - Miami, FL, USA (2013.12.4-2013.12.7)] 2013 12th International Conference on Machine Learning

radiologists on the panel usually did not all agree on whether ornot a nodule was present. Only radiologists who determinedthat a nodule was shown in the image supplied ratings andcontours. Therefore the number of contours and semanticratings per nodule varies from one to four.

B. Experiment

1) Segmentation and Feature Extraction: From each of the2,636 CT scans, a ROI was acquired by creating a bounding-box around the radiologist’s outline. While our bounding boxeswere formed from the radiologists’ outlines, they could alsobe generated from a simple indication of center and extent.Otsu’s method [17] was used on the ROI to generate four seg-mentations by using four thresholds. The four thresholds splitthe region into five zones. Four segmentations were created bycombining a varying number of zones together, starting withthe nodule’s inner-most zone and ending with outer-most non-background zone. Each formed a logical matrix identifying theseparation of the nodule from the background (Fig.1 provides avisual display of both radiologists’ segmentations and the OtsuSegmentations). Due to the global nature of the Otsu method,all disjoint regions from the largest region were removedand any voids in the segmented nodule were filled using analgorithm based on morphological reconstruction [18]. From

Fig. 1. A) Original ROI with a suspect nodule. B) 5 level Otsu applied tothe ROI in A). C) The radiologists’ outlines of this nodule. D) The final Otsumasks produced by the segmentation. ”N” refers to the range of zones beingconsidered in each mask, with zone 1 as the background.

these weak segmentations and the radiologist’s original nodulecontours, descriptive image features were extracted (see TableI for complete list of features).

For a detailed description of shape, size, intensity, andGabor image features, refer to [4]. Haralick feature calculationsin this work differ from the previous work in three respects.First, due to the number of segmentations and slow runningtime, only a single pixel distance was used in the calculationof the co-occurrence matrices. Second, the implementation isbased on a faster calculation of 13 original Haralick texturefeatures. Third, we do not include the Maximal CorrelationCoefficient due to computational instability [19].

2) Label and Feature Usage: A significant difficulty ingenerating a direct comparison between the weak segmentationbased WSCAD and the reference truth CAD based on theradiologists’ contours arose from the variation in the numberof ratings and contours supplied per nodule by the radiologists.In the WSCAD, the number of segmentations per nodule andtherefore the number of features extracted per nodule wasconstant throughout the dataset. For example, our experimentproduced four weakly segmented regions and therefore cal-culated four perimeter lengths, one from each segmentation.In contrast to this however, each nodule could receive one,two, three or four contours and ratings from the panel ofradiologists.

To eliminate the variation among the quantity of featuresgenerated for each nodule in the reference truth, a singleradiologist supplied contour was selected for each nodule to beused in image feature generation. This selection was made bychoosing the contour with the largest area out of those suppliedfor that nodule.

To resolve the variation among the number of labels butstill capture all semantic ratings supplied by the panel, theratings from all radiologists that supplied ratings for a nodulewere averaged into a single value. These 2,636 averaged valueswere used to train and test both the reference truth and theweak segmentation based classifier.

3) Learning Process and Predictions: The overall pro-cesses of generating reference truth predictions and weaksegmentation based predictions are shown in Fig.2. First, tocreate a reference truth to compare against, a CAD systemwas developed based on the ROI manually supplied by theradiologists’ contours. For each nodule, the contour with thelargest area was selected, and image features were extractedfrom that region. Using the extracted features and the averagedsemantic ratings given by the panel as labels, a set of three

TABLE I. FEATURES

Shapea Size Intensitya Texturea

CircularityRoughnessElongation

CompactnessEccentricity

SolidityExtent

RadialDistanceSD

AreaConvex Area

PerimeterConvex Perimeter

Equivalent DiameterMajor Axis LengthMinor Axis Length

MinIntensityMedianIntensity

MaxIntensityMeanIntensitySDIntensity

bgMinIntensitybgMedianIntensity

bgMaxIntensitybgMeanIntensitybgSDIntensity

Intensity Difference

13 Haralick Features (Angular Second Moment,Contrast, Correlation, Sum of Squares: Variance, Inverse

Difference Moment, Sum Average, Sum Variance,Sum Entropy, Entropy, Difference Variance,

Difference Entropy, Info. Measure of Correlation 1,Info. Measure of Correlation 2)

24 Gabor Features (Mean and SD of Gabor filters of fourorientations and three scales)

a SD = Standard Deviation, bg = Background

521521521521

Page 4: [IEEE 2013 12th International Conference on Machine Learning and Applications (ICMLA) - Miami, FL, USA (2013.12.4-2013.12.7)] 2013 12th International Conference on Machine Learning

LargestRadiologistBorder

ExtractedFeatures

ClassifiersDecision TreeNeural NetworkBagged Trees

StackedGeneralizer

PredictionsRounded toNearest Class

Radiologists’ Ratings

AveragedRatings

WeakSegmentation 1

ExtractedFeatures 1

StackedGeneralizer

PredictionsRounded toNearest Class

ClassifiersDecision Tree 1Neural Network 1Bagged Trees 1

Averaged RatingsRounded toNearest Class

Reference TruthPredictions

Semantic LabelClasses

Weak SegmentationBased Predictions

WeakSegmentation N

ExtractedFeatures N

ClassifiersDecision Tree NNeural Network NBagged Trees N

Fig. 2. Processes map for generation and comparison of classifications

classifiers were trained: a decision tree classifier, a neuralnetwork classifier, and a bagged tree classifier (itself a typeof ensemble learning). The classifiers output predictions in theform of deterministic but continuous numerical values such as4.33 as opposed to ’a 90% probably of class 4’. Predictionswere generated by each of the three classifiers for all 2,636nodules through the use of 10-fold cross validation.

The values predicted for each nodule by the three classifierswere then used as input for the stacked generalizer. The stackedgeneralizer used multiple linear regression (MLR) to combinethe predictions from each of the three classifiers into a singlenumerical prediction for each nodule. To calculate the accuracyof this final prediction, both the predictions and the averagedrating labels were first converted to classes by rounding themto the nearest integer and then checked for agreement.

The process used by the WSCAD system to generatepredictions was very similar. While the reference truth usedone contour per nodule and therefore generated one set offeatures, the weak segmentation model generated multipleweak segmentations for each nodule. Each segmentation wasused to generate a separate set of features. A separate set ofthree classifiers were trained on each of the segmentations’feature sets. The outputs from each set of classifiers wereconcatenated to produce 12 inputs for the stacked generalizer

(three classifiers multiplied by four segmentations). Then aswith the reference truth, the success of the WSCAD classifierwas evaluated by rounding the final predictions and averagedsemantic ratings to the nearest integer and calculating thepercent of agreement.

IV. RESULTS AND DISCUSSION

Table II shows the resulting accuracy of the two CADsystems. The weak segmentation based WSCAD performed anaverage of 3.86% better than the reference truth over the sevensemantic categories. This improvement ranged from 1.21% to6.11%. A t-test confirmed with a threshold of p = 0.01 that theweak segmentation based classifier performed as well or betterthan the reference truth for each category. In other words,classifiers based on weak segmentation can perform as well orbetter than the classifiers based on radiologist supplied hardsegmentations.

Our results show that it is possible to match the perfor-mance of semantic rating classifiers that depend on radiologistsupplied hard segmentations with classifiers based on weaksegmentations. Potential exists for weak segmentation basedclassifiers to develop into a tool useful to radiologists duringCT scan evaluation.

522522522522

Page 5: [IEEE 2013 12th International Conference on Machine Learning and Applications (ICMLA) - Miami, FL, USA (2013.12.4-2013.12.7)] 2013 12th International Conference on Machine Learning

TABLE II. PERCENT OF SAMPLES CLASSIFIED CORRECTLY

Subtlety Sphericity Margin Lobulation Spiculation Texture Malignancy Average

Weak Segmentation Ensemble 53.15% 60.09% 54.32% 61.12% 65.71% 69.84% 61.42% 60.81%

Reference Truth 49.39% 56.83% 48.22% 56.64% 61.65% 65.67% 60.20% 56.94%

Improvement Over Reference Truth 3.76% 3.26% 6.11% 4.48% 4.06% 4.17% 1.21% 3.86%

V. CONCLUSION

Many of the current crowdsourcing based mass segmenta-tions are not practical to use in a CAD system due to multiplereasons such as the dependence on a crowd to generate thesegmentations and the various privacy concerns related toreleasing patients’ medical information, even anonymously, tothe public. When performing medical diagnoses, a crowd willnot always be available to perform mass segmentations. AWSCAD system is not constrained by these problems. Usinga closed system for diagnosis limits the generation of highlynoisy data typical of crowd source based solutions. Moreover,the machine based system is extendable to new segmentationalgorithms that would generate more and better segmentations.

Having radiologists segment the nodules is an obviousway to obtain quality segmentations, but the time requiredto produce such detailed information represents a significantcost. Our experiment demonstrates a weak segmentation basedensemble classifier can predict semantic ratings as well as asystem that requires nodules to be segmented by radiologists.Weak segmentation based classifiers have the potential tobecome a low cost CAD tool to aid in the diagnosis of lungnodules [2]. While the bounding boxes used by the WSCADin this study were formed from the radiologists’ outlines, theycould be generated by simply indicating the center and extentof the box without providing any outlines.

VI. FUTURE WORK

The performance of both the weak segmentation basedclassifier and the reference truth classifier may be improvedby considering the datasets’ meta features. Studies on mam-mography CAD reported improvements in accuracy when thedata set was divided according to meta features such as thepatient’s age or the nodule’s size. While patient data is notavailable for the LIDC dataset, it may be possible to uncoverrelationships between the amount of agreement among theradiologists and the image features calculated on the ROI. Thiswould allow the set of nodules to be split into subsets and sentto learning ensembles specialized in predicting nodules withthose characteristics.

While in this experiment stacked generalization was ap-plied using continuous numerical predictions as input, it maybe of interest to test the performance when using class andprobability pairs as input. One consideration will be the factthat semantic rating data are unbalanced when treated asclasses. To overcome this issue, we are investigating ran-dom under-sample boost (RUSBoost) which combines bothunder-sampling and Adaboost to achieve higher classificationaccuracy [20]. In addition to RUSBoost, another ensemblemethod we wish to explore is an improved Random Forestalgorithm that is designed specifically for datasets that havehigh-dimensional data with multiple classes [21].

The performance of the weak segmentation based classifiermight be improved by increasing the number of segmentations.

To generate additional segmentations, additional preprocessingsteps could be employed, such as Gaussian filtering to removeimage noise. Furthermore, we have already developed sixadditional segmentations based on a simple region-growingalgorithm. Even more segmentations can be generated throughregion-growing by adjusting the stopping criteria and varyingthe seed point. Region growing could also be developed intoan algorithm to automatically calculate the extents of the ROI’sbounding box from a single point within the nodule identifiedby radiologists. This would allow a CAD to be developed witheven less effort by radiologists than if they supplied boundingboxes.

ACKNOWLEDGMENT

This work is partially supported by NSF grant #1062909.

REFERENCES

[1] D. R. Aberle, A. M. Adams, C. D. Berg, W. C. Black, J. D. Clapp, R. M.Fagerstrom, I. F. Gareen, C. Gatsonis, P. M. Marcus, J. Sicks et al.,“Reduced lung-cancer mortality with low-dose computed tomographicscreening,” The New England Journal of Medicine, vol. 365, no. 5, p.395, 2011.

[2] L. Ries, D. Melbert, M. Krapcho, D. Stinchcomb, N. Howlader,M. Horner, A. Mariotto, B. Miller, E. Feuer, S. Altekruse et al., “SEERcancer statistics review, 1975-2005,” Bethesda, MD: National CancerInstitute, pp. 1975–2005, 2008.

[3] K. Awai, K. Murao, A. Ozawa, M. Komi, H. Hayakawa, S. Hori, andY. Nishimura, “Pulmonary nodules at chest CT: effect of computer-aided diagnosis on radiologists’ detection performance,” Radiology, vol.230, no. 2, p. 347, 2004.

[4] D. Zinovev, D. Raicu, J. Furst, and S. G. Armato III, “Predicting radi-ological panel opinions using a panel of machine learning classifiers,”Algorithms, vol. 2, no. 4, pp. 1473–1502, 2009.

[5] C. R. Meyer, T. D. Johnson, G. McLennan, D. R. Aberle, E. A. Kaze-rooni, H. MacMahon, B. F. Mullan, D. F. Yankelevitz, E. J. van Beek,S. G. Armato III et al., “Evaluation of lung MDCT nodule annotationacross radiologists and methods,” Academic Radiology, vol. 13, no. 10,pp. 1254–1265, 2006.

[6] Y. Zhang, N. Tomuro, J. Furst, and D. S. Raicu, “Building an ensemblesystem for diagnosing masses in mammograms,” International Journalof Computer Assisted Radiology and Surgery, vol. 7, no. 2, pp. 323–329,2012.

[7] O. Zinoveva, D. Zinovev, S. A. Siena, D. S. Raicu, J. Furst, andS. G. Armato, “A texture-based probabilistic approach for lung nodulesegmentation,” in Image Analysis and Recognition. Springer, 2011,pp. 21–30.

[8] C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier, “Collectingimage annotations using Amazon’s Mechanical Turk,” in Proceedingsof the NAACL HLT 2010 Workshop on Creating Speech and LanguageData with Amazon’s Mechanical Turk. Association for ComputationalLinguistics, 2010, pp. 139–147.

[9] S. Mavandadi, S. Dimitrov, S. Feng, F. Yu, U. Sikora, O. Yaglidere,S. Padmanabhan, K. Nielsen, and A. Ozcan, “Distributed medical imageanalysis and diagnosis through crowd-sourced games: a malaria casestudy,” PloS One, vol. 7, no. 5, 2012, e37245.

[10] S. Vijayanarasimhan and K. Grauman, “Cost-sensitive active visualcategory learning,” International Journal of Computer Vision, vol. 91,no. 1, pp. 24–44, 2011.

[11] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2,pp. 123–140, 1996.

523523523523

Page 6: [IEEE 2013 12th International Conference on Machine Learning and Applications (ICMLA) - Miami, FL, USA (2013.12.4-2013.12.7)] 2013 12th International Conference on Machine Learning

[12] Y. Freund, R. E. Schapire et al., “Experiments with a new boostingalgorithm,” in ICML, vol. 96, 1996, pp. 148–156.

[13] D. H. Wolpert, “Stacked generalization,” Neural Networks, vol. 5, no. 2,pp. 241–259, 1992.

[14] K. M. Ting and I. H. Witten, “Issues in stacked generalization,” Journalof Artificial Intelligence Research, vol. 10, pp. 271–289, 1999.

[15] J. Sill, G. Takacs, L. Mackey, and D. Lin, “Feature-weighted linearstacking,” arXiv preprint arXiv:0911.0460, 2009 unpublished.

[16] S. G. Armato, G. McLennan, L. Bidaut, M. F. McNitt-Gray, C. R.Meyer, A. P. Reeves, B. Zhao, D. R. Aberle, C. I. Henschke, E. A.Hoffman et al., “The lung image database consortium (LIDC) and imagedatabase resource initiative (IDRI): A completed reference database oflung nodules on CT scans,” Medical Physics, vol. 38, pp. 915–931,2011.

[17] N. Otsu, “A threshold selection method from gray-level histograms,”Automatica, vol. 11, no. 285-296, pp. 23–27, 1975.

[18] P. Soille, “Morphological image analysis: Principles and applications,”pp. 173–174, 1999.

[19] E. Miyamoto and T. Merryman, “Fast calculation of Haralick tex-ture features,” Available from: users.ece.cmu.edu/∼pueschel/ teaching/18-799B-CMU-spring05/material/eizan-tad.pdf , 2005 unpublished.

[20] C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano,“RUSBoost: A hybrid approach to alleviating class imbalance,” IEEETrans. Syst., Man, Cybern. A, vol. 40, no. 1, pp. 185–197, 2010.

[21] B. Xu, Y. Ye, and L. Nie, “An improved random forest classifierfor image classification,” in Information and Automation (ICIA), 2012International Conference on. IEEE, 2012, pp. 795–800.

524524524524