automated detection of simulated motion blur in digital...

Automated Detection of Simulated Motion Blur

in Digital Mammograms

Nada Kamona, B.S. and Murray Loew, Ph.D.

Department of Biomedical Engineering, George Washington University, Washington, D.C.

Rationale

Motion blur is a known phenomenon in full-field digital mammography that arises during image acquisition. It

has been reported to reduce lesion detection performance and mask small microcalcifications, resulting in

failure to detect smaller abnormalities until they reach more advanced stages. It is estimated that 20% of

screening mammograms show elements of blur. Motion blur has been found to be due mainly to paddle motion

(up to 1.5 mm vertically) during the clamping phase of the mammography exam. We propose using machine

learning algorithms to automatically detect motion blur, which could support the clinical decision-making

process during the mammography exam by allowing for an immediate retake, thereby preventing unnecessary

expense, time, and patient anxiety.

Methods

To mimic blur seen in mammograms, we simulated it mathematically. The blur point-spread function mask is

generated by displacing an individual pixel by a random vector (within the range of the blur effect) and the

pixel contribution to the overall image is then sampled on a regular pixel grid using subpixel linear

interpolation. This randomly-generated motion trajectory is constrained by several factors; we examined the

effects of variations in tissue elasticity, imaging exposure time, and size of blur effect (motion boundary in

millimeters). The blur mask is convolved with a mammogram to create blur. Three motion blur magnitudes

(0.5, 1.0, and 1.5 mm) were simulated on 68 mammograms (INbreast Database, normal cases, CC and MLO

views). Blur was quantified using 17 blur operators for each mammogram and at each blur level (272 images

total). Machine learning classifiers, including Linear Support Vector Machine (SVM) and Subspace

Discriminant Ensemble (SDE), were trained to distinguish three levels of blurred from unblurred mammograms,

using four-way classification.

Results

The average accuracy for classifying unblurred and blurred mammograms at three levels of magnitude was

75.40% and 74.60% for Linear SVM and SDE respectively. The true positive rate was highest for classifying

mammograms with no simulated blur, reaching 99% for both classifiers with a false negative rate of 1%. For

Linear SVM, the true-positive rates for blur levels 0.5 mm, 1.0 mm, and 1.5 mm are 75%, 57%, and 71%

respectively, while the false-negative rates are 25%, 43%, and 29% respectively. For SDE, the true-positive

rates are 72%, 51%, and 76% and the false-negative rates are 28%, 49%, and 24% for blur levels 0.5 mm, 1.0

mm, and 1.5 mm respectively. Training the classifiers to distinguish mammograms with no blur from those with

the lowest simulated blur level (0.5 mm) had accuracies of 98.5% and 97.8% for the Linear SVM and SDE

respectively.

Conclusion

Our preliminary results show the potential to detect simulated blur automatically using machine learning

classifiers and blur operators. Although limited work has been done to quantify the effects of motion blur on

radiologists’ performance, there is evidence that although motion blur might not be detected visually by a

human observer, it can nevertheless affect diagnostic performance. We are now using larger mammographic

datasets to train convolutional neural networks and validate the developed blur model.

Assessment of BREAST as a learning tool for

breast cancer detection for trainees using digital

mammography

A Ganesan MSc1, PC Brennan PhD1,2, K Tapia MSc2, C Mello-Thoms PhD 1,3

1Medical Image Optimization and Perception research Group (MIOPeG), Faculty of Health Sciences, University

of Sydney, NSW, Australia. 2BreastScreen Reader Assessment Strategy (BREAST), University of Sydney, NSW, Australia.

3University of Iowa, Department of Radiology, IA, USA.

Rationale

Mammography is the primary screening tool for early detection of breast cancer. However, about 30% of cancers

are missed. Previous research suggested that the level of readers’ experience is one of the most important factors

affecting their accuracy in detecting lesions. The Breast Screen Reader Assessment Strategy (BREAST) is an online

testing platform that enables assessment of clinicians’ performance including radiologists, trainees and breast

physicians in detecting breast cancer using digital mammograms. A recent study showed that engaging with test-

sets significantly improved the radiologists breast cancer detection performance. In this study, we aim to study the

impact of BREAST as training tool in improving trainees’ breast cancer detection performance.

Methods

This study was conducted using BREAST, an online screen reading test which allows readers to read test sets, report

the cancerous cases, mark their location and rate them on a scoring scale of 1-5 (1-“normal”, 2-“benign”, 3-

“equivocal”, 4-“suspicious, and 5 -“malignant”). Five test-sets including Hobart, Sydney, Darwin, Melbourne and

Gold Coast and twenty-three trainees, who completed at least three of the test-sets in chronological order of release,

were included in this study. To demonstrate the level of improvement, the test sets were grouped in three (G1, G2

and G3) based on the order of release and readers who completed the three test-sets from each group were arranged

in chronological order of completion and named as TS1, TS2 and TS3 respectively. Performance measures including

sensitivity, specificity, location sensitivity, area under the receiver operating characteristics curve (AUC ROC) and

jackknife alternative free-response receiver operating characteristic (JAFROC) figure–of-merit of every test-set

were compared between each pairs of test set from each group.

Results

The results showed significant improvement in specificity between some of the test sets for one group of trainees.

No other significant improvement in trainees’ performance was shown.

Conclusion

It is interesting to note that whilst BREAST has been a very effective tool for improving radiologists’ performance,

the improvements are not so evident with registrars. The most likely explanation is that the cases used for BREAST

are highly challenging ones which may be too difficult for educational purposes for more junior doctors. The need

to tailor test sets specific to the level of training and experience is emphasized.

Human and model observer study for task

detection in digital breast tomosynthesis

Seungyeon Choi1), Sunghoon Choi2), Donghoon Lee1), Young-Wook Choi3), and Hee-Joung Kim1),2)*

1) Department of Radiation Convergence Engineering, Yonsei University, Wonju, Korea

2) Department of Radiological Science, Yonsei University, Wonju, Korea

3) Pioneering Medical-Physics Research Center, Korea Electrotechnology Research Institute

(KERI), Ansan 15588, Republic of Korea

Rationale

Task-based assessment of image quality through theoretical observer model has recently brought the

attention in medical imaging fields. Observer models which can suitably match with the human

observer performance under various imaging conditions have been considered as a key idea for virtual

clinical studies. The current work is mainly focused on the experimental studies using a prototype

digital breast tomosyntehsis (DBT) system to compare between the task-based metrics of detectability

index and the human observer performance within various tomosynthesis angular range imaging

protocols.

Methods

We used the prototype DBT system developed by Korea Electrotechnology Research Institute with

the different angular range setups from ±10.5° to ±24.5° while using the same 15 projection images.

Human observer performance was measured in four alternative-forced-choice (AFC) tests for detection

of different tasks including spheroidal masses and microcalcification clusters. For task-based

detectability index (d’), the non-prewhitening matched filter observer were calculated by analyzing

task function, local spatial resolution and local noise of spheroidal masses. The percentage correctly

detected signals (𝑃𝑐𝑜𝑟𝑟) of 4AFC tests were then compared with the d’.

Results

In the human observer study, the average 𝑃𝑐𝑜𝑟𝑟 from seven observers were 0.87, ranging 𝑃𝑐𝑜𝑟𝑟 values

from 0.71 to 0.92. The resulted patterns of 𝑃𝑐𝑜𝑟𝑟 decreased with increasing the angular ranges from ±10.5° to ±24.5° with different size of tasks. Moreover, the performance of the theoretical model

observer values resulted in similar trend to the human observers’ 𝑃𝑐𝑜𝑟𝑟 results.

Conclusions

In this study, we focused on the evaluation of the task-based human and model observer study by

comparing detectability index and 𝑃𝑐𝑜𝑟𝑟 among several tomosynthesis angular range setups. The performance of the model observer resulted in similar trend to the human observer results in our

prototype DBT system. The correlation between theoretical and measured performance is necessary

for better description of task-based model observer performance for future study.

Sneak Peak: Are Radiologist Search Patterns

Altered by a 2D Preview Before a Breast

Tomosynthesis Image?

Nicholas M. D’Ardenne, MBBS1; Robert M. Nishikawa, PhD1; Margarita L. Zuley, MD 1,2,

Chia-Chien Wu, PhD3; Jeremy M. Wolfe, PhD3.

1. Department of Radiology, University of Pittsburgh, Pittsburgh, PA.

2. University of Pittsburgh Medical Center, Magee Womens Hospital, Pittsburgh, PA.

3. Visual Attention Lab, Harvard University, Cambridge, MA.

Rationale

Digital Breast Tomosynthesis (DBT) is beginning to be used more frequently alongside Full Field

Digital Mammography (FFDM) in routine breast screening. One draw back of this newer

technology is the longer reading times. We aim to investigate if search patterns, duration of reading

and accuracy of diagnosis differ if radiologists are given a 2D preview before viewing 3D

tomosynthesis cases.

Methods

Readers were instructed to search for lesions as they would under normal clinical conditions and

were informed that this would be an enriched study (10 positive cases out of 20). Eye tracking

used a SMI RED250mobile Eye Tracker sampling at 250Hz. Calibration aimed for tracking error

below 0.5 deg. The images were read on an EIZO RadiForce GS520 5MP (2048 x 2560 native

resolution) monitor. There were three viewing conditions: 1) FFDM images alone, 2) DBT alone,

3) DBT with a FFDM preview. A single view was presrnt for each case. Cases were read over 3

sessions with a washout period of at least one week. Accuracy of diagnosis, time spent on the study

and search patterns were recorded for each case.

Results

Preliminary results from 3 (out of 12) readers (table 1) have been reviewed. Two were experienced

readers (with 20 and 30 years of experience); the third with 3 years experience. These preliminary

results indicate that there is a decrease in the time spent viewing DBT when a 2D preview is

provided from a mean of 63.7seconds (range 9.6-217.3) without preview to 47.0 seconds (8.1-

134.4) with the preview. The mean sensitivity and specificity of the observers findings are

essentially unchanged despite this decrease in time taken. In the eye tracking data, a somewhat

smaller percentage of breast area is covered when a reader has a preview (32%) compared to when

they do not have a preview (37%), assuming a 5 degree window around each fixation.

Table 1

Conclusions

Our preliminary results suggest there is a decrease in the time taken to view DBT cases when a 2D

preview is supplied. As there is relative decrease of 14% of breast area reviewed by readers with

a 2D preview, it may allow readers to focus search on a smaller fraction of the image without

sacrificing accuracy. We will present results from all 12 readers at the meeting.

Without 2D Preview With 2D Preview Change Between Viewing Conditions (Δ)

Subj.

Mean View

Time with

Range

(sec)

Sensitivity

(sens.)

Specificity

(spec.)

Area of

Breast

Viewed

(%)

Mean View

Time with

Range

(sec)

Sens. Spec. Area of

Breast

Viewed

(%)

Mean View

Time (sec)

Sens. Spec. Area of

Breast

Viewed

(%)

1 24.9 (9.6-

44.9)

0.50 0.70 28 19.5 (9.7-

47.1)

0.50 0.70 25 -5.4 0 0 -11

2 58.2 (17.8-

127)

0.80 0.90 34 58.9 (30.2-

133)

0.80 0.80 35 0.7 0 -0.1 3

3 104 (45-

217)

0.90 0.50 50 56.4 (8.1-

134)

0.80 0.70 36 -47.3 -0.1 0.2 -28

Mean 63.7 (9.6-

217)

0.73 0.70 37 47.0 (8.1-

134)

0.70 0.73 32 16.7 -0.03 0.03 -14

Identifying Sources for Improving Breast

Image Quality within the setting of the

MQSA EQUIP

Lonie R Salkowski MD MS PhD1,2, Jess Harried RT3

University of Wisconsin School of Medicine & Public Health, Department of Radiology1

University of Wisconsin School of Medicine & Public Health, Department of Medical Physics2

University of Wisconsin Health Sciences3

Rationale In January 2017, the Enhancing Quality Using the Inspection Program (EQUIP) was added to the

FDA/MQSA breast imaging program to ensure image quality review and implementation of

corrective processes. Breast image quality is the responsibility of both the technologists and

radiologists. Improper image quality can result in potentially missed breast cancers. Prior

research has suggested that positioning is a major reason for technical recalls. Breast imaging

fellowship trained radiologists spend a full year learning all elements of breast imaging including

assessment of image quality. General radiologists receive training about image quality in their

three months of required breast imaging during residency. Based on training practices it is

reasonable to expect differences in the type and number of technical recalls from fellowship

trained and general radiologists who practice breast imaging.

Methods This HIPAA-compliant study was exempt from IRB review. In consecutive screening

mammograms (January 2015 through December 2018), prospectively recorded technical recalls

were collected from a hybrid breast imaging service. The technical recalls were compared for

imaging modality (FFDM or DBT), images requested, and indication(s) for technical recall

(motion, positioning, technical/artifact). Chi-squared tests evaluated statistical significance

between proportions.

Results During the study interval, 58,448 screening mammograms were performed with 141 technical

recalls requested by the radiologists (0.24%). During the 1013 clinical days, 32.3% had coverage

with a breast fellowship trained radiologist. The general radiologists made 33 technical recalls,

and fellowship trained radiologists made 108 recalls. Comparing the images requested for

technical recall, general radiologists (28.3%) requested significantly more Left CC views than

fellowship trained (11.8%) (p=0.0059). The differences in requests for Right CC, Right MLO

and Left MLO were not significantly different. Although there was a trend for fellowship trained

radiologists to recall more Right MLO and Left MO views.

The general radiologists had 38 reasons for recalling 33 cases, compared to 150 reasons for 108

recalls for the fellowship trained radiologists. There were significant differences in three groups

of reasons (motion, positioning, technical/artifact) for technical recall between fellowship trained

and general radiologists. General radiologists (36.8%) requested significantly more technical

recalls for motion compared to fellowship trained radiologists (14.0%)(p=0.0013). Fellowship

trained radiologists (68.0%) requested significantly more recalls for errors in positioning

compared to general radiologists (39.5%) (p=0.0012). There was no significant difference

(p=0.4279) in fellowship trained and general radiologists for artifact based technical recalls (18%

and 23.7% respectively).

Conclusions The EQUIP program requires that there is mechanism for image quality improvement and

feedback. Fellowship trained breast imagers have more concentrated and longer training in

image quality than general radiologists. Additional training for both general and fellowship

trained radiologists in identifying image quality, with attention to positioning errors, will

enhance a breast imaging program and provide improved patient care.

Relationship between Obuchowski-

Rockette and Gallas U-statistic methods

for analyzing multi-reader diagnostic

imaging data

Stephen L. Hillis, PhD

Departments of Radiology & Biostatistics, University of Iowa

Rationale

The Obuchowski-Rockette (OR) and Gallas U-statistic (U-stat) methods have been the two most

frequently used methods for analyzing multireader multicase (MRMC) diagnostic imaging data

that allow conclusions to generalize to both the reader and case populations. The OR method is

the more general method because it can be used with any reader-performance measure, whereas

the U-stat method is limited to a U-statistic outcome, such as the empirical (or trapezoidal) AUC

statistic. On the other hand, advantages of the U-stat method are that it provides exact

expressions for the outcome variance, provides unbiased variance estimates, and makes it easy

to size future studies having a different abnormal-to-normal case ratio than was used in a pilot

study. However, previously it has not been clear if there is a direct link between the two

methods. In this talk I discuss a particular version of the OR method that produces the same test

statistic as the U-stats method

Methods

I discuss a new way to estimate the error covariances when using the OR model which utilizes

the U-statistic approach.

Results

I show analytically that this version of the OR method produces the same test statistic as the U-

stats method.

Conclusions

Showing that a U-stats analysis can be performed using the OR method is useful in several ways:

(1) Previously the U-stats method was previously limited to comparison of two modalities. Now

the U-stats method can be used for testing for equivalence for several modalities, because the OR

method allows for this. (2) The equivalence of the statistics establishes that there is now an

unbiased variance version of the OR method available for U-statistic outcomes. (3) For U-

statistic outcomes, it is now easy to use the OR method to compute sample size for studies

having a different abnormal-to-normal case ratio than was used in a pilot study. (4) Negative

variances using the U-stat method can be avoided by using the well-tested OR approach for

computing degrees of freedom and constraining the variance to be positive. (5) If researchers

want to analyze a U-statistic outcome, they no longer have to be concerned with the question of

which method is better?

The strength of the gist of the abnormal in the

unilateral and bilateral mammograms

Ziba Gandomkar*a , Ernest U. Ekpoa , Sarah J. Lewisa , Karla K. Evansb , Kriscia Tapiaa , Tong Lia, Seyedamir

Tavakoli Tabaa, Jeremy M. Wolfec , Patrick C. Brennana a Medical Imaging Sciences, Faculty of Health Sciences, University of Sydney, Sydney, NSW, Australia;

BreastScreen Reader Assessment Strategy (BREAST), University of Sydney, Sydney, NSW, Australia. b Department of Psychology, University of York, Heslington, York, UK. c Visual Attention Lab, Harvard Medical School, Cambridge, MA, USA.

Rationale

Experts can perceive the gist of the abnormal in the negative prior unilateral mammograms of women who subsequently

diagnosed with breast cancer. Here, we compared the strength of the gist from unilateral and bilateral mammograms.

Methods

Seventeen radiologists viewed 60 cases in two different experiments (GistUnilateral and GistBilateral). In GistUnilateral, 60

unilateral craniocaudal mammograms were presented in a randomly generated sequence for a half-second to the

radiologists, who were asked to provide an abnormality probability for each case on a scale from 0 (confident normal)

to 100 (confident abnormal). In GistBilateral, we presented bilateral mammograms of the same cases using a similar

experimental protocol. Readers were randomly assigned to two groups, the first did the unilateral experiment first while

the second group did the bilateral experiment first. Four categories of mammograms (15 cases per category) were

included: 1) Cancer cases, which contained biopsy-proven malignancies; 2) Normal cases, which remained normal at

least for next two years; 3) Prior_Vis cases, which contained retrospectively visible non-actionable cancer signs; 4)

Prior_Invis cases, which did not contain visible cancer signs. Mammograms from the last two groups were from women

who subsequently developed biopsy-proven malignancies. For each radiologist and each category, the Pearson

correlation between the unilateral and bilateral gist responses was calculated. In each experiment, three pair-wise

classifications, i.e. Cancer/Normal, Prior_Vis/Normal, Prior_Invis/Normal were analysed. A paired, two-sided

Wilcoxon Signed Rank test was used to investigate whether the values of area under receiver operating characteristic

curves (AUC) were at an above-chance (AUC=0.5) level. The same test was also used to show whether the AUC values

from two experiments differed significantly for each pair-wise classification. For each radiologist and each case, we also

calculated the average of the two gist responses recorded in the two experiments and produced GistAVE, i.e.

½(GistUnilateral+GistBilateral).

Results

The averages of correlation coefficient across 17 readers for Cancer, Normal, Prior_Vis, Prior_Invis, and all cases were

0.17 (CI=0.03-0.31), 0.26 (CI=0.09-0.43), 0.30 (CI=0.12-0.49), 0.35 (CI=0.21-0.49), and 0.35 (CI=0.25-0.44),

respectively. The order of median AUCs in Cancer/Normal and Prior_Vis/Normal classifications from the highest to the

lowest was GistAVE>GistUnilateral>GistBilateral. All differences except the difference for GistAVE and GistUnilateral in

Prior_Vis/Normal classification were significant. In Prior_Invis/Normal classification, the order was

GistAVE>GistBilateral>GistUnilateral. None of the differences in the AUC values for Prior_Invis/Normal classification were

significant. On average, the AUCs of Cancer/Normal, Prior_Vis/Normal, Prior_Invis/Normal classifications based on

GistUnilateral respectively dropped by 8%±6%, 10%±8%, and 1%±8% in the bilateral experiment while these AUCs

increased by 5%±3%, 2%±4%, and 4%±6% after averaging two signals. On average, the AUCs of Cancer/Normal,

Prior_Vis/Normal, Prior_Invis/Normal classifications based on GistAVE were 82%±4%, 74%±3%, and 67%±5%.

Conclusions

There is weak association between the gist signal from unilateral and bilateral mammograms. The signal was stronger in

the unilateral experiment. When two signals were averaged, the AUCs increased. The improvement could be as a result

of cancelling out random noise by averaging two values. Further investigation of intra-reader variability and exploring

the AUC when unilateral gist responses of a reader were averaged in multiple experiments is required.

Perceptual Training –

Learning versus Attentional Shift

Soham Banerjee, MD [1]; Megan Mills, MD [1]; Trafton Drew, PhD [2];

William F. Auffermann, MD/PhD [1*]

[1] Department of Radiology and Imaging Sciences, University of Utah Health, Salt Lake City,

UT, USA; [2] Department of Psychology, University of Utah, Salt Lake City, UT, USA;

[*] Corresponding Author

Rationale: Perceptual training (PT) has been shown to improve healthcare trainees’ ability to identify

abnormalities on chest radiography (CXR). Specifically, recent studies have examined the

effects of search pattern training, and showed improved performance with training. However, it

was not clear if the improved performance was due to learning, or due to an attentional shift

resulting from queuing related to the training. The objective of this study is to determine

whether improved subject performance on CXR evaluation after PT is due to learning or

attentional shift.

Methods: A perceptual training experiment with 41 physician assistant trainees was performed. All

subjects voluntarily participated and provided informed consent. Subjects evaluated CXRs for

appropriate central venous catheter (CVC) positioning and other imaging related tasks before and

after educational interventions. For the intervention, the control group received an attentional

control task, and the experimental group received perceptual training in the form of search

pattern training for CVC characterization.

Many of the subjects' tasks were similar to prior studies and included: 1) Marking the tip of the

catheter, 2) Indicating their confidence in catheter tip localization, 3) Indicating whether the

catheter was adequately positioned or malpositioned.

In addition, subjects were asked to rate whether the cardiac silhouette was normal or abnormally

enlarged using a 5-point scale. Information on how to perform cardiac evaluation was given

only at the beginning of the study with the study’s introductory materials.

Subject ability to characterize the adequacy of catheter positioning (Line-Safe) and the heart size

(Heart-Size) were quantified using receiver operating characteristic (ROC) analysis. Subject

ability to localize the catheter tip (Line-Loc) was quantified using localization ROC (LROC)

analysis. The figure of merit for performance was the area under the curve (AUC).

Results: The difference in AUC for subject performance before and after the educational intervention and

the corresponding p-values are given in the table below.

Line-

Loc

Line-

Loc

Line-

Safe

Line-

Safe

Heart-

Size

Heart-

Size

ΔAUC P-Value ΔAUC P-Value ΔAUC P-Value

Control -0.11 0.88 0.06 0.01 0.04 0.01

Experimental 0.30

RadSimP - A Custom Software Solution

for Perceptual Training Compared with

Current Perceptual Software

Soham Banerjee, MD [1]; Megan Mills, MD [1]; Trafton Drew, PhD [2];

William F. Auffermann, MD/PhD [1*]

[1] Department of Radiology and Imaging Sciences, University of Utah Health, Salt Lake City,

UT, USA; [2] Department of Psychology, University of Utah, Salt Lake City, UT, USA;

[*] Corresponding Author

Rationale: Recent studies have shown the utility of perceptual training (PT) for teaching healthcare

trainees good perceptual habits when evaluating medical images. Prior studies were performed

using software designed for perceptual observer studies. As most software packages for image

perception are geared towards research, they were not optimized for perceptual training and

assessment. To date, there had been no software packages specifically designed for perceptual

training. The goal of this study is to determine if perceptual training using our custom software

solution, RadSimP, resulted in improved performance relative to training using current

perceptual research software.

Methods: PT for central venous catheter (CVC) positioning was performed using a counterbalanced

design. Subjects were shown several sets of chest radiographs (CXRs) with CVCs that were

either adequately or malpositioned. Subjects were asked to: mark the tip of the catheters, rate

their confidence in catheter tip localization, and state whether or not the catheters were

adequately positioned.

The same study was conducted twice using two different PT software packages. Study-A used

ViewDEX (https://sas.vgregion.se/en/for-dig-som-ar/vardgivare/viewdex/), a software package

for perceptual research. Study-B used RadSimP, our custom perceptual training and radiology

workstation simulator software package, written in Python.

All subjects voluntarily participated and provided informed consent. For Study-A, 14 physician

assistant students participated. For Study-B, 41 physician assistant students participated.

Training and assessment was done at individual computer workstations in an educational

computer classroom.

During Study-A, the trainees had to manually switch between folders on the desktop to access

the appropriate educational materials and had to manually enter information to get the correct

set of cases for assessment. For Study-B, the RadSimP program seamlessly integrated subject

consent, training, practice, and assessment in a simulated radiology workstation environment.

In addition, RadSimP was loaded onto the classroom’s network storage drive, such that all

subjects could run it simultaneously, and the results were automatically collected in a central

location.

A survey was given to subjects after the completion of the study to assess the subjects’

impressions of perceptual training and the RadSimP software package. The survey asked if

subjects felt the search pattern training and simulator environment were helpful for learning

about radiology. Responses were collected using a 5-point Likert response format (where 5

indicates strongly agree).

Results: Using both training paradigms, the subjects in the experimental group showed a statistically

significant improvement in their ability to characterize a catheter as acceptable versus

malpositioned. The difference in areas under the localization receiver operator characteristic

curves were 0.07 using the conventional software and 0.1 using RadSimP, p-values of 0.02 and

Meaningful Feedback in Breast Imaging

Simulation Assessment

Lonie R Salkowski, MD MS PhD1,2, Mai A Elezaby MD1, Elizabeth A Krupinski, PhD3

University of Wisconsin School of Medicine, Department of Radiology1

University of Wisconsin School of Medicine, Department of Medical Physics2

Emory University, Department of Radiology & Imaging Sciences3

Rationale There is a lack of objective assessment on the quality of interpretive skills in radiology residency

training. This is especially pertinent in breast imaging where there is no independent

interpretation of exams during residency and the majority of residents will not pursue a breast

imaging fellowship. Simulation is a validated technique which facilitates independent

interpretation of thoughtfully developed clinical cases with sequential exposure during residency

training. The format of meaningful feedback should provide both formative and summative

information that is beneficial and not punitive for residents.

Methods We developed a breast imaging simulation to provide serial feedback for residents over the four

years of their radiology residency training. Users will be provided feedback in several formats.

First, a modified medical audit (recall rate, cancer detection rate, sensitivity, specificity, PPV1, PPV2, and PPV3) that introduces the residents to the typical MQSA mandated annual feedback

for radiologists that interpret mammograms. This audit will have higher educational impact when

the data are presented in the context of the resident’s own work and preparing them to reach

national performance benchmarks. Second, the users will be provided feedback on their

assessment of lesion types (masses, calcifications, asymmetries, architectural distortion). This

will objectively highlight areas that may need additional review and emphasis in the educational

program for users and clinical educators. Third, since breast density is a recently introduced

national concern and has implications for clinical practice (from the notification of patients about

their tissue density to the offering of additional clinical tests for patients with high tissue

density), it will be important to assess the user’s understanding of breast density. Lastly, to the

keep the users motivated and provide an element of competitive playfulness in a residency

program, a gamification component has been added to the assessment. This will lead to more

engagement by residents with the intent that they will be better prepared for independent

interpretation at the completion of the residency. This gamification component will provide

overall scores for assessments that will be displayed on a leader board (user self-assigned names)

format for comparison to peers.

Results Within the context of the assessments developed in the simulation (modified medical audit,

lesion type assessment, breast tissue density, and gamification) the results will be correlated with

resident level (first, second, third rotations) targets, and how residents’ in-training medical audits

compare to the national benchmarks trajectory. The residents will be serially debriefed with

formative and summative results, and as data is collected peer-level comparisons will be

provided.

Conclusions Thoughtful development of objective assessment measures in a simulation design with feedback

is important, in order to provide users with a meaningful simulation experience.

Optimality of tool selection in radiologists and naïve subjects

Lisa M. Heisterberg, BS & Andrew B. Leber, PhD

The Ohio State University, Department of Psychology & Medical Scientist Training Program

Rationale While multiple avenues of research have investigated errors in radiology, one possible cause of errors that has received limited study is the way in which radiologists interact with Picture Achieving and Communication System (PACS) software. PACS software contain critical elements that allow radiologists to view and manipulate medical images, but their many tools and features can put individuals at risk for making sub-optimal choices. Acquiring radiologist subjects, and the high complexity of PACS software can make researching this topic difficult. Hence, we have developed a simple laboratory based visual search task that approximates windowing, a PACS feature that allows for contrast enhancement of images. We sought to determine if non-expert performance in our task could inform us about radiologist performance, and if subjects would approach our task optimally.

Methods 26 radiologists and 26 subjects naïve to radiological image interpretation completed our study. Each trial tasked subjects with deciding if a letter T was present or absent in displays containing distractor Ls. One of three classes of displays were shown on each trial. For the 80% of T present trials, the T was not always immediately visible. Each display was initially shown with a default contrast adjustment setting applied that revealed the target on 5% of trials. Subjects could select from 3 additional adjustment settings; an optimal setting that revealed the target on 75% of trials, and 2 other settings that each displayed the target on 10% of trials. Subjects were pre-informed which adjustment setting was optimal for each display class. Selecting the optimal setting first, then selecting additional settings if the target was not found, would allow for the most accurate and efficient search.

Results Accuracy for reporting the absence or presence of a target was not significantly different for radiologists (85.3%) and naïve subjects (83.5%). Naïve subjects spent significantly less time per trial (13.0s) than radiologists (16.5s). The percentage of the time the optimal adjustment setting was selected first was not significantly different between radiologists (79.6%) and naïve subjects (86.7%). For both radiologists and naïve subjects, those that more often selected the optimal setting first had significantly faster average trial completion times, with no differences in accuracy. Lastly, on target present trials where the target was not visible with the optimal setting, subjects were significantly more likely to decide a target was absent; indicating that if a target was not visible using the optimal setting, subjects often neglected selecting other settings or continuing their search.

Conclusions These results demonstrate that in our simplified search task, radiologists are not completely accurate, are more efficient when making optimal choices, and can display sub-optimal behaviors; all of which are similar to naïve subjects. Such results reveal that the performance of non-radiologist subjects can inform us about radiologist performance in our task. Future studies will correlate radiologist performance in this simplified task with their performance using professional PACS. Overall we hope to understand how radiologists interact with PACS software, why they may act sub-optimally, and how sub-optimal behaviors can be reduced.

Reducing Errors in Pathology Image-

based Decisions through Maximum

Confidence Slating

Jennifer S. Trueblood1, PhD, William R. Holmes2, PhD, Adam C. Seegmiller3, MD, PhD,

Charles Stratton3, MD, Quentin Eichbaum3, MD, PhD 1Department of Psychology, Vanderbilt University, 2Department of Physics and Astronomy,

Vanderbilt University, 3Department of Pathology, Microbiology and Immunology, Vanderbilt

University Medical Center

Rationale Second opinions can significantly improve diagnostic accuracy. However, multiple readings by

different individuals are not always feasible due to shortages in pathology and laboratory medicine

workforce, particularly in low resource settings. We examine whether it is possible to reduce errors

by having the same person perform multiple readings. Research in decision-making has shown a

“wisdom of the crowd within” effect, improving accuracy by aggregating responses from a single

individual. We apply a similar strategy to decisions about the pathology images.

Methods In two experiments, participants (novices in Exp 1 and experts in Exp 2) viewed images of white

blood cells and decided if it contained a blast cell (pathological white blood cell) or not. On each

trial, participants were asked to make a binary choice followed by a confidence rating, indicating

their confidence in their decision. Participants viewed each image twice.

Results Results showed confidence was greater for correct as compared to incorrect responses (Exp 1: F(1,

36) = 106.83, p < .001 and Exp 2: F(1, 21) = 33.77, p < .001). We then applied a maximum

confidence slating algorithm (MAX, Koriat, 2012) to each individual’s decisions. For each image,

MAX selects the trial with the higher confidence. We compared this approach with average

performance (AP) as well as a minimum confidence slating (MIN) algorithm that selects the lower

confidence response for each image. We found a main effect of algorithm (Exp 1: F(2, 72) = 29.41,

p < .001 and Exp 2: F(2, 42) = 12.21, p < .001), with post hoc tests showing that MIN generated

lower accuracy than AP, and AP generated lower accuracy than MAX.

Conclusions In sum, our results show that confidence is associated with accuracy in pathology decisions and

suggests that it can be used as a way to aggregate multiple readings within the same individual.

A novel learning-based paradigm to

investigate the visual-cognitive bases of

lung nodule detection

Frank Tong1,2, Ph.D., Malerie G. McDowell1, B.A., William R. Winter3,

M.D., M.S.,

and Edwin F. Donnelly3, M.D., Ph.D

1 Psychology Department, Vanderbilt University

2 Vanderbilt Vision Research Center, Vanderbilt University

3 Department of Radiology, Vanderbilt University Medical School

Rationale: Even expert radiologists will sometimes fail to detect the presence of a pulmonary

nodule in a chest X-ray image, with estimated rates of missed detection of 20-30%.

The challenging nature of this diagnostic task lies not only in the visual contrast or

the size of the nodule, but also in the heterogeneity of nodule appearance and the

variability of the local anatomical background. The goal of our study was to

develop a learning-based paradigm, using image processing software to generate a

large, heterogeneous set of visually realistic simulated nodules, to gain insight into

the visual and cognitive bases of lung nodule detection.

Methods: The current version of our software allows for the creation of simulated nodules

with heterogeneous appearance, allowing for rigorous control over the size, shape,

brightness, contrast, and placement of nodules in 2D chest radiographs.

Results: At the MIP Lab at RSNA, we tested radiologist participants (n=10) with both real

and computer-simulated nodules at a challenging nodule localization task.

Performance accuracy was significantly better for real nodules than for the subtle

simulated nodules we created (70.5% vs. 59.0% accuracy, p < 0.005). Of greater

interest, radiologists performed no greater than chance level at discriminating

whether nodules were real or simulated (mean accuracy 52.9%). Next, we

evaluated the impact of training naive undergraduate participants at a localization

task involving simulated nodules. Participants underwent 3-4 training sessions and

viewed a total of 600 simulated cases. We observed significant improvements

following training for both simulated nodules (30.3% accuracy pre-test, 78.2%

accuracy post-test, p < 0.00001) and real nodules (37.5% pre-test, 62.5% accuracy

post-test, p < 0.0005.). In our next experiment, we investigated whether extended

training with either light or dark polarity nodules would lead to polarity-specific

training benefits in initially naive undergraduates. This indeed proved to be the

case, implying that this training regimen led to the learning of a polarity-specific

perceptual template of nodule appearance. Finally, we conducted an exploratory

pilot study with 6 radiology residents to see whether they might show performance

improvements following training with our nodule localization task. The results of

this initial pilot revealed a highly significant improvement in performance with

simulated nodules on the final test day, and a non-significant trend of improvement

for real nodule cases.

Conclusions: Taken together, our results demonstrate that marked improvements in nodule

detection can be achieved by implementing a training regimen with numerous

realistic examples, and moreover, that trained undergraduates can serve as useful

model observers for investigating the visual-cognitive bases of nodule detection.

With continued refinement of our simulation methods and training set of images,

we anticipate that it should be possible to further boost generalization of these

training benefits to real nodule test cases. Future developments of this nodule

localization training paradigm could prove useful as a software tool for enhancing

the diagnostic training of radiology residents.

The Importance of Peripheral Visual

Processing and Eye Movements in Search

with 3D Images Miguel P. Eckstein, Miguel A. Lago, Craig K. Abbey

Department of Psychological & Brain Sciences, UC Santa Barbara, Santa Barbara, CA. 93106, USA

RATIONALE When radiologists use a 3D imaging modality to diagnose a disease they often read the data as a

stack of 2D slices and scroll through the slices. The foveated nature of the human visual system

and the typical reading times prevent radiologists from exhaustively exploring all regions of the

image set with their high-resolution fovea. Thus, radiologists must rely on vision away from the

fovea (the visual periphery) to process many regions of the images. Here, we investigate how

target detectability varies with retinal eccentricity and explore eye movement patterns and

detection accuracy during 3D search.

METHODS We measured target detectability of various targets (small and large targets) briefly presented at a

known location (50% probability) in filtered noise and digital breast tomosynthesis phantoms.

Eye position monitoring allowed us to ensure that observers maintained gaze on a fixation point.

In a separate study, observers searched for the large and small targets (50 % probability of target

presence) in 3D volumetric images with the two backgrounds. Observers were given unlimited

time to scroll and search. We measured search accuracy (true positive rate, false positive rate),

eye movements and scrolls.

RESULTS The results show strong dissociations on detectability in the visual periphery across large and

small targets for both synthetic textures and DBT phantoms. Detectability for the small target

degraded abruptly in the visual periphery while that of the larger target reduced more

moderately. For the 3D search, participants were unable to explore significant portions of the

data with fixational eye movements suggesting that they relied on peripheral processing for their

decisions. We found that 3D search led to a significant reduction in target detectability of the

small targets. We found large variability in human performance detecting the small targets in 3D

search. Individual detectabilities for the small signals in 3D search were related to the observers’

eye movements: observers’ search accuracies were inversely correlated with the average closest

distance of the observers’ fovea to the signal. Detectability did not correlate with search times.

CONCLUSION For 3D imaging modalities, the properties of the human visual periphery and eye movements are

critical in determining the detectability of searched targets and might play an important role

determining individual variability in search accuracy.

Foveated Model Observers applied to DBT

image phantoms Miguel A. Lago1, Bruno B. Barufaldi2, Predrag R. Bakic2, Craig K. Abbey1,Susan P. Weinstein2, Brian

Englander2, Andrew D. Maidment2, Miguel P. Eckstein1

1Department of Psychological & Brain Sciences, UC Santa Barbara, Santa Barbara, CA, USA 2Department of Radiology, University of Pennsylvania, Philadelphia, PA, USA

RATIONALE Digital Breast Tomosynthesis (DBT) is becoming the standard for breast imaging. New 3D imaging

modalities bring large volumes of data that cannot be exhaustively explored with eye movement fixations.

Thus, radiologists read the images with visual processing away from the fovea (visual periphery).

However, accuracy in the visual periphery is degraded relative to foveal processing. Current model

observers (Channelized Hotelling and Non-Prewhitening Matched Filter with an Eye Filter) only model

visual processing at the fovea and might not be sufficient to account for human performance with 3D

imaging modalities. Here, we propose a new Foveated Channelized Hotelling Observer (FCHO) that

incorporates vision across the entire visual field with reduced spatial detail away from the fovea. We

compare the FCHO model to traditional non-foveated model observers in their ability to predict human

performance when searching for simulated microcalcifications and masses in breast phantoms.

METHODS We designed an experiment consisting of a free search of a simulated signal (microcalcification or mass)

within the UPENN DBT phantom. We ran 12 radiologists in 28 trials of a 3D DBT (64 slices) search and

28 trials of a single slice 2D DBT search with 50% signal presence. Additionally, we trained our FCHO

to search for the simulated signals within the breast phantoms. Our model analyzes the visual field with

different templates at different distances from the fixation point. It also includes an eye movement model

that selects the next fixation point and a scrolling model that goes through the slices of the volumetric

image. We compared human and model accuracy for a single (central) slice of DBT and the complete 3D

DBT.

RESULTS Human observer performance shows a significantly lower detectability for the microcalcification in 3D

search (d'=1.36±0.33) compared with the 2D (d'=3.78±0.54) while masses do not show a significant

difference (2D: d’=1.89±0.38; 3D: d’=1.65±0.40). The FCHO model performance shows an agreement

with these results and is capable to capture the interactions in detectability between the signal type

(microcalcification vs. mass) and the search task (2D vs. 3D) that we also have seen with previous results

for our FCHO model for images with correlated Gaussian noise.

CONCLUSION We presented the first application of the FCHO to more realistic images containing structures, different

tissues, and backgrounds that are more complex. The FCHO model correctly predicted the dissociation

in results across signal types while traditional model observers did not. The results motivate the use of

foveation in model observers in order to assess image quality in 3D DBTs.

The Role of Comparison in Categorization

Learning of Chest X-ray: An Eye

Movement Study

Yanju Ren, Ph.D.1; Yuanjie Zheng, Ph.D.2 1School of Psychology, 2School of Information Science and Engineering, Shandong Normal

University, Jinan, P. R. China; E-mail: [email protected]; [email protected]

Rationale: The chest X-ray is one of the most commonly accessible radiological examinations for

screening and diagnosis of many lung diseases. How a Novice observer learn to classify a

chest X-ray as normal or abnormal (furthermore, Benign or malignant) is an important

research topic in the domain of medical education. Comparison learning is one of the key

processes by which people learn and also is broadly found to be very effective in the context

of, for example, category learning. So the present study is to explore the role of comparison

in chest X-ray categorization using eye tracking technology.

Methods: To this end, two eye movement experiments were conducted. Experiment 1 consisted of three

phases. In the pretest phase, forty-eight undergraduate students participated in the chest X-ray

categorization (normal or abnormal) task to obtain the baseline performance. In the second

phase, an half of participants (24 out of 48) was assigned to comparative learning condition

and the other half of participants was assigned to non-comparative learning condition. In the

test phase, the participants perform the categorization task. Experiment 2 also contained

similar three phases, only comparison task was employed and presentation time of the chest

X-ray was manipulated.

Results: In Experiment 1, the participants category the chest X-ray at the chance level in the pretest

phase, and by comparison learning, the comparative learning group obtained remarkable

improvement in chest X-ray categorization task in the test phase, reflected in the shorter

reaction time, less number of saccade, shorter fixation duration, and longer saccade amplitude

etc. The participants from long presentation time group get the better categorization

performance than those from short one.

Conclusions: The two eye tracking experiments demonstrate that learning style and presentation time of

chest X-ray have important roles in medical image categorization.

Investigating Observer Gaze Patterns on Facial

Disfigurements from Head and Neck Cancer

Krista M. Nicklaus, MSE1,2, Enrique Callado3, Joowon Cho, MSE3, Jun Liu, PhD2, Mary Catherine Bordes,

BS2, Gregory P. Reece, MD2, Summer E. Hanson, MD, PhD2, Jeffery M. Engelmann, PhD4,

Mia K. Markey, PhD1,5

1Biomedical Engineering, The University of Texas at Austin, 2Plastic Surgery, The University of Texas MD

Anderson Cancer Center, 3Electrical Engineering, The University of Texas at Austin, 4Psychiatry and

Behavioral Medicine, Medical College of Wisconsin, 5Imaging Physics, The University of Texas MD Anderson

Cancer Center

Rationale

Facial disfigurement resulting from head and neck cancer can have devastating effects on psychosocial

functioning. Body image changes experienced by head and neck cancer patients contribute to high levels of

depression and anxiety, social isolation, impaired quality of life, and sexual difficulties. Many head and neck

cancer patients feel discounted or stigmatized, are preoccupied by appearance changes, or avoid social

situations due to changes in appearance and functioning. Feeling stigmatized can arise from concerns about how

others will react to one’s appearance as well as how others actually react. Individuals with facial disfigurement

are highly aware of how others behave toward them (e.g., staring, gaze aversion, unwelcomed comments) and

facial expressions perceived to convey negative emotional reactions (i.e., disgust). Our long-term goal is to

support head and neck cancer patients in developing realistic expectations about how other people in non-social

group settings (e.g., customers in a grocery store) will respond to their appearance by presenting selected

information from a normative database of the responses of lay observers to facial disfigurement resulting from

head and neck cancer. While several methodologies could be used to study observers’ cognitive, behavioral, and

emotional responses to facial disfigurement, eye tracking has the potential to help us understand behavioral

responses such as staring and gaze aversion. This preliminary study examines lay observers’ gaze patterns when

looking at clinical photographs of head and neck cancer patients.

Methods

Eye movements were recorded and tracked with a Tobii TX300 Eye tracker (Tobii Technology Inc., Falls

Church, VA), with a sampling rate of 300 Hz. 20 lay observers viewed 144 face images for 6 seconds each (4

images were used for practice). The images are from 35 head and neck cancer patients with varying degrees of

disfigurement over multiple time points from 1 to 12 months post face reconstruction. Two clinical experts

determined whether the facial disfigurement was on the left or right side of the face, or undetermined. The

midline of the face was defined by the line from the central hairline, through the pronasale, to the lowest point

of the chin. Gaze fixations and saccades were mapped using the EyeMMV toolbox in MATLAB (Mathworks,

WA, USA). Fixation location, dwell time, and saccades were investigated in relation to the location of

disfigurement.

Results

The locations of fixations, duration of fixations, and saccades were mapped to each image. There was

substantial variation in the gaze patterns across observers and stimuli.

Conclusions

Eye tracking data has the potential to identify features of disfigured faces that attract attention of lay observers.

However, substantial inter-observer variability suggests that future work is needed to investigate factors that

influence lay people’s cognitive, behavioral, and emotional responses to facial disfigurement. Factors to

consider include quantitative measures of facial disfigurement; the lay person’s body image and affective state;

and the layperson’s demographic variables.

Influence of radiology expertise on the perception of nonmedical images Brendan Kelly, Louise A. Rainford, Mark F. McEntee, Eoin C. Kavanagh

Abstract Identifying if participants with differing diagnostic accuracy and visual search behavior during radiologic tasks also differ in nonradiologic tasks is investigated. Four clinician groups with different radiologic experience were used: a reference expert group of five consultant radiologists, four radiology registrars, five senior house officers, and six interns. Each of the four clinician groups is known to have significantly different performance in the identification of pneumothoraces in chest x-ray. Each of the 20 participants was shown 6 nonradiologic images (3 maps and 3 sets of geometric shapes) and was asked to perform search tasks. Eye movements were recorded with a Tobii TX300 (Tobii Technology, Stockholm, Sweden) eye tracker. Four eye-tracking metrics were analyzed. Variables were compared to identify any differences among the groups. All data were compared by using nonparametric methods of analysis. The average number of targets identified in the maps did not change among groups

[ mean=5.8mean=5.8 of 6 targets (range 5.6 to 6 p=0.861p=0.861 )]. None of the four eye-tracking metrics investigated varied with experience in either search task ( p>0.5p>0.5 ). Despite clear differences in radiologic experience, these clinician groups showed no difference in nonradiologic search pattern behavior or skill across complex images. This is another viewpoint adding to the evidence that radiologic image interpretation is a learned skill and is task specific.

Variations in Lung Nodule Detection and Functional Visual Field of Radiologists

Geoffrey D. Rubin, MD, MBA, Brian Harrawood, Kingshuk RoyChoudhury, PhD,

Justus E. Roos, MD, Martin Tall, Sandy Napel, PhD Departments of Radiology, Duke University and Stanford University

Rationale The foveal gaze of radiologists is exposed on average to only 27% of the lung volume, yet 76% of imbedded lung nodules are included in that volume. This suggests that radiologists’ functional visual field (FVF) for lung nodule detection in CT scans extends well beyond the limits of central gaze (within a 5° gaze angle). To better characterize radiologists’ FVF while dynamically scrolling through CT scans, we measured the distance between radiologists’ gaze point and a lung nodule immediately prior to its formal detection at the “moment of recognition”.

Methods Time-varying gaze traces acquired from 13 radiologists using unconstrained stacked transverse section paging and eye tracking during the interpretation of 40 chest CT scans enriched with 157 simulated 5-mm solid lung nodules were subdivided into periods of nodule visibility (exposures). “Gaze distances” were measured between gaze points and lung nodules to quantify their relationship with nodule exposure duration and detection. The moment of recognition (MoR) was defined as the time point immediately preceding the saccade that converged upon and resulted in the immediate detection of a nodule. MoR distances were measured and characterized as central (foveal) versus peripheral vision based upon a 5° gaze angle threshold.

Results There were 9,751 nodule exposures, defined as discrete periods of nodule visibility exclusive of those following a detection, that consumed 6% of the total search time. 3,371 of these exposures resulted in the detection of 997 TP nodules (49% detection rate). The duration of exposure to undetected (false negative) nodules was 3.5 times longer than TP nodules (p

Figure: (A) Free longitudinal (z) search path from a single reader examining a chest CT scan with three 5-mm lung nodules centered on the orange lines and visible over the faded orange bands. Red regions indicate periods when nodules were displayed on visualized cross-sections but were not detected. The green region indicates the period when one of the three nodules was detected by the reader. The three nodules were visible for 3.1, 3.3, and 3.4% of the search time and were exposed to central gaze for 0.2, 0.0, and 0.1%, respectively, across the 357 second search duration. (B) The 3.5 second region contained within the green zone in (A) is magnified and displayed with the corresponding gaze point samples. Vertical gray zones indicate regions where the target lung nodule is not visible at the extremes of the slab through which the subject scrolls. Selected time points (1-6) are illustrated with corresponding CT section, gaze point (red circle with 50-pixel diameter), target (orange circle), and acceptance of the detection (green circle). The subject is positioned such that central gaze (5° gaze angle) is within 90 pixels of the gaze point. At the beginning of the trace, the nodule is not visible, but the subject scrolls down and the nodule is revealed when the gaze is 353 pixels away (1). The gaze then deviates closer to the x, y position of the nodule (2), but moves back to the posterior lung (3). Following a saccade, the gaze shifts anteriorly to within 164 pixels of the nodule (4). Another saccade ensues bringing the gaze within 50 pixels of the nodule, just as the viewer scrolls beyond the nodule, reverses scroll direction and lands on the nodule (5). After 1 second scrutinizing the nodule, it is accepted (6). Based upon the location of the final saccade converging on the nodule, the moment of recognition is classified to occur at the dotted black line and the preceding time period is considered to be search while the subsequent time period is considered to be decision making.

Visual search behavior reveals differences

in diagnostic accuracy based on

experience

Joe Thomas, BSc1, Bradley Fawver, PhD1, Megan Mills, MD2, William Auffermann, MD, PhD2,

Trafton Drew, PhD3, and A. Mark Williams, PhD1

1 University of Utah; Department of Health, Kinesiology & Recreation 2

University of Utah; Department of Radiology and Imaging Sciences 3University of Utah; Department of Psychology

Rationale

A substantial number of medical errors in radiology are attributed to failures of perception or

failures of decision making. Although it is believed that experience in diagnostic imaging

naturally leads to the development of expertise, data from other medical fields suggests this may

not be the case. The purpose of this study was to explore how diagnostic accuracy differs across

radiology professionals as a function of experience, as well as ascertain the extent to which

changes in visual search behaviors underlie improved diagnostic outcomes.

Methods

Twenty radiologists (5 Attending, 5 Fellows, 10 Residents) dictated their findings on 10

musculoskeletal cases (negative and abnormal cases included) obtained from a medical database.

Mobile eye-tracking glasses sampled gaze behavior at 120 Hz, while Likert-scale measures of

mental effort and confidence were obtained after each case. Key areas of interest (i.e., where the

abnormality was located) were identified on each abnormal case, and two radiologists coded

accuracy. Simple linear regressions were utilized to explore relationships between experience

(i.e., resident, fellow, attending physician), diagnostic outcomes (e.g., trial time, accuracy), and

attentional processes (e.g., fixation, saccadic behavior).

Results

Participants demonstrated an 89% accurate rate on negative cases and a 67% accurate rate on

present cases, so analyses proceeded exclusively on abnormal cases. Attending physicians

exhibited only marginally improved diagnostic accuracy on abnormal cases (67%) compared to

individuals in the resident program (61%). Level of experience was associated with reduced trial

time (p < .001) and increased confidence in the diagnosis (p = .004). More experienced

individuals demonstrated fewer fixations (p =.001) of shorter duration (p =.003) on the dictation

screen, fewer fixations the medical images (p < .001), and fewer fixations on key areas of

interest (p = .002). Experience was also associated with increased saccadic amplitude (p = .007)

and decreased peak saccadic velocity (p < .001). After controlling for experience, the total

number (p = .001), duration (p = .015), and percentage of fixations (p = .004) on key areas of

interest was associated with improved diagnostic accuracy.

Conclusion

As expected, experienced radiologists spent less time diagnosing each case and were more

confident in their diagnosis. Experience was also associated with more purposeful visual search

behavior on the images and more efficient use of medical imaging technology. However, while

time spent viewing information-rich areas of the medical images (i.e., the abnormality) was

positively associated with diagnostic accuracy, it was negatively associated with experience.

Findings suggest a physician’s confidence in their diagnosis might be misplaced when cases are

dictated too quickly or when individuals spend insufficient time extracting relevant information

from key areas of the visual display.

Impact of expertise on reading mammograms: An eye-tracking study

Lucie Lévêque1,2 (MSc), Hilde Bosmans3 (PhD), Lesley Cockmartin3 (PhD), Hantao Liu2 (PhD)

1School of Computer Science and Informatics, Cardiff University, United Kingdom

2Department of Computer Science and Software Engineering, Xi’an Jiatong Liverpool University, China

3Department of Radiology, University Hospitals KU Leuven, Belgium

Rationale

Breast cancer screening uses low-dose x-rays to detect cancers early, and thus to allow a more efficient treatment. It is critical to understand how medical professionals perceive and interpret mammograms with a view to reduce errors in screening mammography. Various eye-tracking studies have been undertaken in this area, presenting different experimental designs (e.g, films vs. digital mammograms, public databases vs. selected cases). A prominent topic in the literature is the comparison between experienced and less experienced readers.

Methods

An eye-tracking experiment was conducted with several expert radiologists, trainee radiologists, and physicists, who were asked to read 196 medio-lateral oblique (MLO) mammogram views from 98 patients. The cases were free of lesions, but the readers were not informed about this fact. After reading both left and right images of a case, the participants had to answer the following question: “refer or not refer?” by focusing their gaze on one of these options on the screen. The eye movements of the participants were recorded using a non-invasive SMI Red-m eye-tracking system.

Results

Gaze information was extracted from the raw eye-tracking data obtained during the experiment, including the number of fixations per stimulus, their coordinates and duration. An analysis of variance (ANOVA) was used to study the similarity between the three expert radiologists in terms of mean fixation duration. Results show no statistically significant difference between the three expert radiologists (i.e., p

Fig. 1: Illustration of the mean fixation duration of expert radiologists R1, R2 and R3 (in red), trainee radiologists T1, T2 and T3 (in green), and physicists P1 and P2 (in blue), averaged over all fixations recorded for all test stimuli. Error bars indicate a 95% confidence interval.

Saliency maps, i.e., topographic representations indicating conspicuousness of scene locations, were created using the fixations obtained from the eye-tracking experiment. Each fixation location gave rise to a greyscale patch simulating the foveal vision of the human system. In a saliency map, salient regions represent where the observers focused their gaze with a higher frequency. It can be noticed on the maps that expert and trainee radiologists’ gaze patterns are concentrated, whereas physicists’ gaze patterns are more distributed over the mammogram.

Conclusions

An eye-tracking experiment was designed and conducted to study the impact of medical specialties and level of experience on perceptual behaviour while interpreting mammograms. Results showed that physicists have, in general, a higher dwell time than experts, whereas trainees have a lower dwell time. Furthermore, the physicists gaze patterns were more dispersed than that of the radiologists, whereas the trainees showed similar patterns to that of the radiologists.

The strength of the gist of the abnormal in the

unilateral and bilateral mammograms

Ziba Gandomkar*a , Ernest U. Ekpoa , Sarah J. Lewisa , Karla K. Evansb , Kriscia Tapiaa , Tong Lia, Seyedamir

Tavakoli Tabaa, Jeremy M. Wolfec , Patrick C. Brennana a Medical Imaging Sciences, Faculty of Health Sciences, University of Sydney, Sydney, NSW, Australia;

BreastScreen Reader Assessment Strategy (BREAST), University of Sydney, Sydney, NSW, Australia. b Department of Psychology, University of York, Heslington, York, UK. c Visual Attention Lab, Harvard Medical School, Cambridge, MA, USA.

Rationale Experts can perceive the gist of the abnormal in the negative prior unilateral mammograms of women who subsequently

diagnosed with breast cancer. Here, we compared the strength of the gist from unilateral and bilateral mammograms.

Methods Seventeen radiologists viewed 60 cases in two different experiments (GistUnilateral and GistBilateral). In GistUnilateral, 60

unilateral craniocaudal mammograms were presented in a randomly generated sequence for a half-second to the

radiologists, who were asked to provide an abnormality probability for each case on a scale from 0 (confident normal)

to 100 (confident abnormal). In GistBilateral, we presented bilateral mammograms of the same cases using a similar

experimental protocol. Readers were randomly assigned to two groups, the first did the unilateral experiment first while

the second group did the bilateral experiment first. Four categories of mammograms (15 cases per category) were

included: 1) Cancer cases, which contained biopsy-proven malignancies; 2) Normal cases, which remained normal at

least for next two years; 3) Prior_Vis cases, which contained retrospectively visible non-actionable cancer signs; 4)

Prior_Invis cases, which did not contain visible cancer signs. Mammograms from the last two groups were from women

who subsequently developed biopsy-proven malignancies. For each radiologist and each category, the Pearson

correlation between the unilateral and bilateral gist responses was calculated. In each experiment, three pair-wise

classifications, i.e. Cancer/Normal, Prior_Vis/Normal, Prior_Invis/Normal were analysed. A paired, two-sided

Wilcoxon Signed Rank test was used to investigate whether the values of area under receiver operating characteristic

curves (AUC) were at an above-chance (AUC=0.5) level. The same test was also used to show whether the AUC values

from two experiments differed significantly for each pair-wise classification. For each radiologist and each case, we also

calculated the average of the two gist responses recorded in the two experiments and produced GistAVE, i.e.

½(GistUnilateral+GistBilateral).

Results The averages of correlation coefficient across 17 readers for Cancer, Normal, Prior_Vis, Prior_Invis, and all cases were

0.17 (CI=0.03-0.31), 0.26 (CI=0.09-0.43), 0.30 (CI=0.12-0.49), 0.35 (CI=0.21-0.49), and 0.35 (CI=0.25-0.44),

respectively. The order of median AUCs in Cancer/Normal and Prior_Vis/Normal classifications from the highest to the

lowest was GistAVE>GistUnilateral>GistBilateral. All differences except the difference for GistAVE and GistUnilateral in

Prior_Vis/Normal classification were significant. In Prior_Invis/Normal classification, the order was

GistAVE>GistBilateral>GistUnilateral. None of the differences in the AUC values for Prior_Invis/Normal classification were

significant. On average, the AUCs of Cancer/Normal, Prior_Vis/Normal, Prior_Invis/Normal classifications based on

GistUnilateral respectively dropped by 8%±6%, 10%±8%, and 1%±8% in the bilateral experiment while these AUCs

increased by 5%±3%, 2%±4%, and 4%±6% after averaging two signals. On average, the AUCs of Cancer/Normal,

Prior_Vis/Normal, Prior_Invis/Normal classifications based on GistAVE were 82%±4%, 74%±3%, and 67%±5%.

Conclusions There is weak association between the gist signal from unilateral and bilateral mammograms. The signal was stronger in

the unilateral experiment. When two signals were averaged, the AUCs increased. The improvement could be as a result

of cancelling out random noise by averaging two values. Further investigation of intra-reader variability and exploring

the AUC when unilateral gist responses of a reader were averaged in multiple experiments is required.

Characterizing Image Features That Allow for Rapid Breast Cancer Detection Even Before Appearance of Visibly Actionable Lesions

Karla K. Evans1, & Jeremy M. Wolfe23

1Department of Psychology, University of York 2Department of Surgery, Brigham & Women's Hospital

3Department of Ophthalmology, Harvard Medical School

Rational Expert radiologists can detect a “global gist signal” in mammograms allowing them to distinguish normal from abnormal cases at above chance levels even in mammograms acquired before the development of visible, actionable lesions (“priors”). In previous studies with filtered images, we found that the gist signal was strong in the high spatial frequencies, not in the low frequencies. In the present study, we seek to more precisely isolate the spatial frequency information that radiologists are using to make successful gist decisions.

Methods Radiologists were presented with 120 bilateral mammograms. Half were completely normal and remained normal at least for four subsequent years. The other half were abnormal. These were subdivided equally into three different types of mammograms: subtle cancers, obvious cancers and mammograms acquired 3 years prior to the mammograms that showed visibly actionable cancer. Radiologists were asked to rate the abnormality of the images on a 0-100 scale after exposure of 500 msec. We collected ratings on this set from 21 radiologist at different experience levels. They viewed the full set in three different conditions across 3 blocks. The different conditions were; original mammograms without manipulation, mammograms maintaining spatial frequencies above 0.5 cycle per visual angle degree (cpd) and lastly mammograms maintaining spatial frequencies only above 1 cpd. The order of the blocks was counterbalanced across participants.

Results Using the normal cases for the estimate of false positives in all cases, we can calculate d’ for each of the three type of abnormal case and for each filter condition. The results are shown in the table for all 21 observers and, in parentheses, for the 16 observers who read more than 1000 cases per year:

Original Freq >0.5 cpd Freq > 1.0 cpd Subtle .79 (.94) .61(.84) .32 (.29) Obvious .85 (1.12) .88 (1.10) .62 (.62) Priors .05 (.18) .67 (.88) .05 (.04)

The most interesting finding is that performance improves for Priors when frequencies below 0.5 cpd are filtered out (F(1,15)=23.01, p

The “Gist” in Prostate Volumetric

Imaging

Melissa Treviño, Ph.D.1 and Todd S Horowitz, Ph.D1

Marcin Czarniecki, M.D.2

Ismail B Turkbey, M.D.3 and Peter L Choyke, M. D.3

1Basic Biobehavioral and Psychological Science Branch, National Cancer Institute

2Medstar Georgetown University Hospital 3Molecular Imaging Branch, National Cancer Institute

Rationale

Numerous cognitive psychology studies have demonstrated that we can determine the global

context of complex real-world scenes (“scene gist”) in a brief glimpse. Similarly, radiologists

can identify the “gist” of a radiograph (i.e., abnormal vs. normal) better than chance in breast,

lung, and prostate images presented for half a second. However, this rapid perceptual gist

processing has only been demonstrated in static two-dimensional images. Standard practice in

radiology is moving to three-dimensional (3D) “volumetric” modalities. In volumetric imaging,

such as multiparametric MRI (mpMRI), used in prostate screening, a single case consists of a

series of image slices through the body that are assembled into a virtual stack. Radiologists can

acquire a 3D representation of organ structures by scrolling through stacks. Can radiologists

extract perceptual gist from this more complex imaging modality?

Methods

We tested 14 radiologists with prostate mpMRI experience on 56 cases, each comprising a stack

of 26 T2-weighted prostate mpMRI slices. Lesions (Gleason scores 6-9) were present in 50% of

cases. In practice, lesions are more prevalent and easier to detect in the peripheral zone (PZ) of

the prostate than the transition zone (TZ). For lesion present trials, we used a PZ:TZ ratio of 5:2.

A trial consisted of a single movie of the stack. After each case, participants localized the

cancerous lesion on a prostate sector map, then indicated whether a cancerous lesion was

presented, and gave a confidence rating. Presentation duration was varied between groups.

Radiologists were divided into three groups who viewed cases presented at either 48 ms/slice

(20.8 Hz, n = 5), 96 ms/slice (10.4 Hz, n = 5), or 144 ms/slice (6.9 Hz, n = 4).

Results

Radiologists could detect lesions in both zones above chance, with PZ producing higher d’

scores (d’ [95% CI]: PZ = 0.73 [0.46 – 1.00]; TZ = .50 [0.08 – 0.91]). Detection performance did

not vary significantly with slice duration F(2,11) = 0.74, p = 0.50 (d’ [95% CI]: 48 ms = 0.64 [-

0.22 – 1.50]; 96 ms = 0.78 [0.31 – 1.24]; 144 ms = 0.38 [0.06 – 0.71]) Localization accuracy

(chance ~= 0.08) was 0.40, 0.47, and 0.48, respectively. While the interaction between slice

durations and zone did not reach significance, F(2,11) = 3.53, p =.07, detection peaked at 48 ms

for PZ and 96 ms for TZ (see Figure 1).

Conclusions

Our data indicate that radiologists do develop gist perception for 3D modalities. As expected,

detecting peripheral lesions was easier than transition lesions. Surprisingly, slower presentation

rates did not improve performance. There may be an optimal framerate for processing 3D

anatomical information, depending on anatomical site and/or lesion conspicuity, but further

research is needed.

Figure 1. d’ for lesions located in the PZ & TZ as a function of slice duration. Error bars display

95% confidence intervals.

-1

-0.5

0

0.5

1

1.5

2

48 96 144

d'

PZ & TZ d'

PZ TZ

Perceptual gist in multiparametric imaging

Todd S Horowitz, Ph.D1

and Melissa Treviño, Ph.D.1

Marcin Czarniecki, M.D.2

Ismail B Turkbey, M.D.3

and Peter L Choyke, M. D.3

1Basic Biobehavioral and Psychological Science Branch, National Cancer Institute

2Medstar Georgetown University Hospital

3Molecular Imaging Branch, National Cancer Institute

Rationale

Humans can extract the “gist” of a visual scene in a fraction of a second, categorizing it as, say, indoor

or outdoor, open or closed. This information can facilitate recognition of objects and guide future eye

movements. Recent studies have demonstrated an analogous ability for radiologists to classify briefly

presented images (e.g., mammograms) as “normal” or “abnormal”. Previously, we have extended that

finding to prostate multiparametric magnetic resonance imaging (mpMRI). MpMRI combines

anatomical information from T2

-weighted (T2

W) sequences, and functional sequences such as

conventional diffusion-weighted imaging (DWI) and the apparent diffusion coefficient (ADC).

Standard workstation formats present these imaging modalities side-by-side. Our goal was to study the

nature of mpMRI gist in different modalities. Which modality generates the strongest gist? Are

anatomical or functional sequences more useful? Do these modalities provide independent gist

information? Furthermore, we tested the hypothesis that experts performed better because they were

more likely to fixate lesions during the brief exposure.

Methods

Experiment 1:Three groups of five radiologists with prostate mpMRI experience were shown 100

images from a single modality (T2

W, DWI, or ADC). The same cases were used across groups. Lesions

(Gleason scores 6-9) were present in 50% of the images. Images were taken from the base, mid, or

apex regions of the prostate. Stimuli were presented for 500 ms, followed by a prostate sector map.

Participants first localized the lesion on the sector map (whether or not they saw a lesion), indicated

whether or not a lesion was present, then provided a confidence rating. In Experiment 2, seven novice

observers with no radiological training and two radiologists with prostate mpMRI experience

performed the same task on a set of 100 T2

W images while a Tobii eye tracker recorded their eye

movements.

Results

Experiment 1: All three groups detected lesions better than chance [d' mean(sd): T2

W 0.83(0.51); DWI

0.80 (0.29); ADC 1.16(0.31)]. Partial correlations between modalities, holding Gleason score

constant, were moderate and significant: [r: T2

W x DWI .23; T2

W x ADC .35; DWI x ADC .37].

Experiment 2: One novice was excluded due to poor-quality eye tracking data. Radiologists again

demonstrated above chance lesion detection [d’ mean (sd) 1.30 (0.47)], while novices did not [d’ mean

(sd) 0.01 (0.40)]. As expected, radiologists were more likely to fixate the lesion on target- (i.e., lesion-)

present trials [radiologists: 60%; novices: 26%]. However, this advantage was not driving their superior

lesion detection. For both groups, lesion detection was no more likely when fixating the lesion than

when failing to fixate (chi-sq, p: radiologists 0.92, .33; novices 0.06, .80).

Conclusions

These results indicate that the ADC modality generates the strongest gist signal, but both anatomical

and functional sequences can contribute to mpMRI gist. The moderate correlation across modalities

suggests redundancy. Importantly, performance was unaffected by whether or not observers fixated the

lesion. Radiologists make more informed eye movements even in brief glimpses, but eye movements

do not drive gist perception. Future studies should explore how gist perception drives eye movements

across imaging modalities.

The Sum or Parts: Exploring Radiologist

Reliance on Peripheral Vision and

Holistic Processing through Gaze-

Contingent Viewing

Grace L. Nicora B.A.¹, Victoria Wilson¹, Dustin Stokes PhD², Jeanine Stefanucci PhD¹, &

Trafton Drew PhD¹

Department of Psychology¹ and Department of Philosophy² at the University of Utah

Rationale

The holistic processing theory of expertise posits that experts can quickly take in global

information from their stimuli and then use that information to guide their search for a target.

This perceptual processing advantage is thought to underlie the superior performance associated

with expertise. In radiology, this theory posits that when presented with a chest x-ray, expert

radiologists process the Gestalt (or whole) of the image within as little as 100 milliseconds. This

initial Gestalt impression is thought to quickly guide attention to regions that deviate from

normal, thereby enabling fast and accurate performance. Importantly, the holistic processing

theory predicts that experts rely on their peripheral vision to quickly extract the Gestalt

impression of a case, and that this ability does not generalize to tasks outside of their expertise.

Methods

We tested this theory with radiologists through the use of a gaze-contingent viewing (GCV)

window. A GCV window allows us to restrict the amount of peripheral information available to

the viewer. In this design, radiologists were only able to see a circular region (5° of visual angle)

where they were actively fixating. All other parts of the image were occluded. In order to see

more of the image, radiologists had to move their eyes to reveal different parts of the image.

Radiologists searched two types of images: one was a chest x-ray and the other a non-

radiographic image. To test the reliance on peripheral vision, radiologists were exposed to a

normal viewing condition and the GCV condition. Holistic processing theory predicts these

experts will be impaired in the presence of the GCV window to a greater extent for the chest x-

ray image compared to the control image.

Results

As expected, GCV led to increased viewing time in both tasks. Critically, this cost was larger for

the radiology task than the control task. We also observed an interaction in saccadic amplitude.

GCV led to shorter saccades or both conditions, but the effect was much larger when viewing

chest radiographs.

Conclusions

Our results support the holistic processing theory of expertise in radiology. As predicted by the

holistic processing theory, experts were more impaired in their domain of exper

automated detection of simulated motion blur in digital...

Documents