automated detection of simulated motion blur in digital...
TRANSCRIPT
-
Automated Detection of Simulated Motion Blur
in Digital Mammograms
Nada Kamona, B.S. and Murray Loew, Ph.D.
Department of Biomedical Engineering, George Washington University, Washington, D.C.
Rationale
Motion blur is a known phenomenon in full-field digital mammography that arises during image acquisition. It
has been reported to reduce lesion detection performance and mask small microcalcifications, resulting in
failure to detect smaller abnormalities until they reach more advanced stages. It is estimated that 20% of
screening mammograms show elements of blur. Motion blur has been found to be due mainly to paddle motion
(up to 1.5 mm vertically) during the clamping phase of the mammography exam. We propose using machine
learning algorithms to automatically detect motion blur, which could support the clinical decision-making
process during the mammography exam by allowing for an immediate retake, thereby preventing unnecessary
expense, time, and patient anxiety.
Methods
To mimic blur seen in mammograms, we simulated it mathematically. The blur point-spread function mask is
generated by displacing an individual pixel by a random vector (within the range of the blur effect) and the
pixel contribution to the overall image is then sampled on a regular pixel grid using subpixel linear
interpolation. This randomly-generated motion trajectory is constrained by several factors; we examined the
effects of variations in tissue elasticity, imaging exposure time, and size of blur effect (motion boundary in
millimeters). The blur mask is convolved with a mammogram to create blur. Three motion blur magnitudes
(0.5, 1.0, and 1.5 mm) were simulated on 68 mammograms (INbreast Database, normal cases, CC and MLO
views). Blur was quantified using 17 blur operators for each mammogram and at each blur level (272 images
total). Machine learning classifiers, including Linear Support Vector Machine (SVM) and Subspace
Discriminant Ensemble (SDE), were trained to distinguish three levels of blurred from unblurred mammograms,
using four-way classification.
Results
The average accuracy for classifying unblurred and blurred mammograms at three levels of magnitude was
75.40% and 74.60% for Linear SVM and SDE respectively. The true positive rate was highest for classifying
mammograms with no simulated blur, reaching 99% for both classifiers with a false negative rate of 1%. For
Linear SVM, the true-positive rates for blur levels 0.5 mm, 1.0 mm, and 1.5 mm are 75%, 57%, and 71%
respectively, while the false-negative rates are 25%, 43%, and 29% respectively. For SDE, the true-positive
rates are 72%, 51%, and 76% and the false-negative rates are 28%, 49%, and 24% for blur levels 0.5 mm, 1.0
mm, and 1.5 mm respectively. Training the classifiers to distinguish mammograms with no blur from those with
the lowest simulated blur level (0.5 mm) had accuracies of 98.5% and 97.8% for the Linear SVM and SDE
respectively.
Conclusion
Our preliminary results show the potential to detect simulated blur automatically using machine learning
classifiers and blur operators. Although limited work has been done to quantify the effects of motion blur on
radiologists’ performance, there is evidence that although motion blur might not be detected visually by a
human observer, it can nevertheless affect diagnostic performance. We are now using larger mammographic
datasets to train convolutional neural networks and validate the developed blur model.
-
Assessment of BREAST as a learning tool for
breast cancer detection for trainees using digital
mammography
A Ganesan MSc1, PC Brennan PhD1,2, K Tapia MSc2, C Mello-Thoms PhD 1,3
1Medical Image Optimization and Perception research Group (MIOPeG), Faculty of Health Sciences, University
of Sydney, NSW, Australia. 2BreastScreen Reader Assessment Strategy (BREAST), University of Sydney, NSW, Australia.
3University of Iowa, Department of Radiology, IA, USA.
Rationale
Mammography is the primary screening tool for early detection of breast cancer. However, about 30% of cancers
are missed. Previous research suggested that the level of readers’ experience is one of the most important factors
affecting their accuracy in detecting lesions. The Breast Screen Reader Assessment Strategy (BREAST) is an online
testing platform that enables assessment of clinicians’ performance including radiologists, trainees and breast
physicians in detecting breast cancer using digital mammograms. A recent study showed that engaging with test-
sets significantly improved the radiologists breast cancer detection performance. In this study, we aim to study the
impact of BREAST as training tool in improving trainees’ breast cancer detection performance.
Methods
This study was conducted using BREAST, an online screen reading test which allows readers to read test sets, report
the cancerous cases, mark their location and rate them on a scoring scale of 1-5 (1-“normal”, 2-“benign”, 3-
“equivocal”, 4-“suspicious, and 5 -“malignant”). Five test-sets including Hobart, Sydney, Darwin, Melbourne and
Gold Coast and twenty-three trainees, who completed at least three of the test-sets in chronological order of release,
were included in this study. To demonstrate the level of improvement, the test sets were grouped in three (G1, G2
and G3) based on the order of release and readers who completed the three test-sets from each group were arranged
in chronological order of completion and named as TS1, TS2 and TS3 respectively. Performance measures including
sensitivity, specificity, location sensitivity, area under the receiver operating characteristics curve (AUC ROC) and
jackknife alternative free-response receiver operating characteristic (JAFROC) figure–of-merit of every test-set
were compared between each pairs of test set from each group.
Results
The results showed significant improvement in specificity between some of the test sets for one group of trainees.
No other significant improvement in trainees’ performance was shown.
Conclusion
It is interesting to note that whilst BREAST has been a very effective tool for improving radiologists’ performance,
the improvements are not so evident with registrars. The most likely explanation is that the cases used for BREAST
are highly challenging ones which may be too difficult for educational purposes for more junior doctors. The need
to tailor test sets specific to the level of training and experience is emphasized.
-
Human and model observer study for task
detection in digital breast tomosynthesis
Seungyeon Choi1), Sunghoon Choi2), Donghoon Lee1), Young-Wook Choi3), and Hee-Joung Kim1),2)*
1) Department of Radiation Convergence Engineering, Yonsei University, Wonju, Korea
2) Department of Radiological Science, Yonsei University, Wonju, Korea
3) Pioneering Medical-Physics Research Center, Korea Electrotechnology Research Institute
(KERI), Ansan 15588, Republic of Korea
Rationale
Task-based assessment of image quality through theoretical observer model has recently brought the
attention in medical imaging fields. Observer models which can suitably match with the human
observer performance under various imaging conditions have been considered as a key idea for virtual
clinical studies. The current work is mainly focused on the experimental studies using a prototype
digital breast tomosyntehsis (DBT) system to compare between the task-based metrics of detectability
index and the human observer performance within various tomosynthesis angular range imaging
protocols.
Methods
We used the prototype DBT system developed by Korea Electrotechnology Research Institute with
the different angular range setups from ±10.5° to ±24.5° while using the same 15 projection images.
Human observer performance was measured in four alternative-forced-choice (AFC) tests for detection
of different tasks including spheroidal masses and microcalcification clusters. For task-based
detectability index (d’), the non-prewhitening matched filter observer were calculated by analyzing
task function, local spatial resolution and local noise of spheroidal masses. The percentage correctly
detected signals (𝑃𝑐𝑜𝑟𝑟) of 4AFC tests were then compared with the d’.
Results
In the human observer study, the average 𝑃𝑐𝑜𝑟𝑟 from seven observers were 0.87, ranging 𝑃𝑐𝑜𝑟𝑟 values
from 0.71 to 0.92. The resulted patterns of 𝑃𝑐𝑜𝑟𝑟 decreased with increasing the angular ranges from ±10.5° to ±24.5° with different size of tasks. Moreover, the performance of the theoretical model
observer values resulted in similar trend to the human observers’ 𝑃𝑐𝑜𝑟𝑟 results.
Conclusions
In this study, we focused on the evaluation of the task-based human and model observer study by
comparing detectability index and 𝑃𝑐𝑜𝑟𝑟 among several tomosynthesis angular range setups. The performance of the model observer resulted in similar trend to the human observer results in our
prototype DBT system. The correlation between theoretical and measured performance is necessary
for better description of task-based model observer performance for future study.
-
Sneak Peak: Are Radiologist Search Patterns
Altered by a 2D Preview Before a Breast
Tomosynthesis Image?
Nicholas M. D’Ardenne, MBBS1; Robert M. Nishikawa, PhD1; Margarita L. Zuley, MD 1,2,
Chia-Chien Wu, PhD3; Jeremy M. Wolfe, PhD3.
1. Department of Radiology, University of Pittsburgh, Pittsburgh, PA.
2. University of Pittsburgh Medical Center, Magee Womens Hospital, Pittsburgh, PA.
3. Visual Attention Lab, Harvard University, Cambridge, MA.
Rationale
Digital Breast Tomosynthesis (DBT) is beginning to be used more frequently alongside Full Field
Digital Mammography (FFDM) in routine breast screening. One draw back of this newer
technology is the longer reading times. We aim to investigate if search patterns, duration of reading
and accuracy of diagnosis differ if radiologists are given a 2D preview before viewing 3D
tomosynthesis cases.
Methods
Readers were instructed to search for lesions as they would under normal clinical conditions and
were informed that this would be an enriched study (10 positive cases out of 20). Eye tracking
used a SMI RED250mobile Eye Tracker sampling at 250Hz. Calibration aimed for tracking error
below 0.5 deg. The images were read on an EIZO RadiForce GS520 5MP (2048 x 2560 native
resolution) monitor. There were three viewing conditions: 1) FFDM images alone, 2) DBT alone,
3) DBT with a FFDM preview. A single view was presrnt for each case. Cases were read over 3
sessions with a washout period of at least one week. Accuracy of diagnosis, time spent on the study
and search patterns were recorded for each case.
Results
Preliminary results from 3 (out of 12) readers (table 1) have been reviewed. Two were experienced
readers (with 20 and 30 years of experience); the third with 3 years experience. These preliminary
results indicate that there is a decrease in the time spent viewing DBT when a 2D preview is
provided from a mean of 63.7seconds (range 9.6-217.3) without preview to 47.0 seconds (8.1-
134.4) with the preview. The mean sensitivity and specificity of the observers findings are
essentially unchanged despite this decrease in time taken. In the eye tracking data, a somewhat
-
smaller percentage of breast area is covered when a reader has a preview (32%) compared to when
they do not have a preview (37%), assuming a 5 degree window around each fixation.
Table 1
Conclusions
Our preliminary results suggest there is a decrease in the time taken to view DBT cases when a 2D
preview is supplied. As there is relative decrease of 14% of breast area reviewed by readers with
a 2D preview, it may allow readers to focus search on a smaller fraction of the image without
sacrificing accuracy. We will present results from all 12 readers at the meeting.
Without 2D Preview With 2D Preview Change Between Viewing Conditions (Δ)
Subj.
Mean View
Time with
Range
(sec)
Sensitivity
(sens.)
Specificity
(spec.)
Area of
Breast
Viewed
(%)
Mean View
Time with
Range
(sec)
Sens. Spec. Area of
Breast
Viewed
(%)
Mean View
Time (sec)
Sens. Spec. Area of
Breast
Viewed
(%)
1 24.9 (9.6-
44.9)
0.50 0.70 28 19.5 (9.7-
47.1)
0.50 0.70 25 -5.4 0 0 -11
2 58.2 (17.8-
127)
0.80 0.90 34 58.9 (30.2-
133)
0.80 0.80 35 0.7 0 -0.1 3
3 104 (45-
217)
0.90 0.50 50 56.4 (8.1-
134)
0.80 0.70 36 -47.3 -0.1 0.2 -28
Mean 63.7 (9.6-
217)
0.73 0.70 37 47.0 (8.1-
134)
0.70 0.73 32 16.7 -0.03 0.03 -14
-
Identifying Sources for Improving Breast
Image Quality within the setting of the
MQSA EQUIP
Lonie R Salkowski MD MS PhD1,2, Jess Harried RT3
University of Wisconsin School of Medicine & Public Health, Department of Radiology1
University of Wisconsin School of Medicine & Public Health, Department of Medical Physics2
University of Wisconsin Health Sciences3
Rationale In January 2017, the Enhancing Quality Using the Inspection Program (EQUIP) was added to the
FDA/MQSA breast imaging program to ensure image quality review and implementation of
corrective processes. Breast image quality is the responsibility of both the technologists and
radiologists. Improper image quality can result in potentially missed breast cancers. Prior
research has suggested that positioning is a major reason for technical recalls. Breast imaging
fellowship trained radiologists spend a full year learning all elements of breast imaging including
assessment of image quality. General radiologists receive training about image quality in their
three months of required breast imaging during residency. Based on training practices it is
reasonable to expect differences in the type and number of technical recalls from fellowship
trained and general radiologists who practice breast imaging.
Methods This HIPAA-compliant study was exempt from IRB review. In consecutive screening
mammograms (January 2015 through December 2018), prospectively recorded technical recalls
were collected from a hybrid breast imaging service. The technical recalls were compared for
imaging modality (FFDM or DBT), images requested, and indication(s) for technical recall
(motion, positioning, technical/artifact). Chi-squared tests evaluated statistical significance
between proportions.
Results During the study interval, 58,448 screening mammograms were performed with 141 technical
recalls requested by the radiologists (0.24%). During the 1013 clinical days, 32.3% had coverage
with a breast fellowship trained radiologist. The general radiologists made 33 technical recalls,
and fellowship trained radiologists made 108 recalls. Comparing the images requested for
technical recall, general radiologists (28.3%) requested significantly more Left CC views than
fellowship trained (11.8%) (p=0.0059). The differences in requests for Right CC, Right MLO
and Left MLO were not significantly different. Although there was a trend for fellowship trained
radiologists to recall more Right MLO and Left MO views.
The general radiologists had 38 reasons for recalling 33 cases, compared to 150 reasons for 108
recalls for the fellowship trained radiologists. There were significant differences in three groups
of reasons (motion, positioning, technical/artifact) for technical recall between fellowship trained
and general radiologists. General radiologists (36.8%) requested significantly more technical
recalls for motion compared to fellowship trained radiologists (14.0%)(p=0.0013). Fellowship
-
trained radiologists (68.0%) requested significantly more recalls for errors in positioning
compared to general radiologists (39.5%) (p=0.0012). There was no significant difference
(p=0.4279) in fellowship trained and general radiologists for artifact based technical recalls (18%
and 23.7% respectively).
Conclusions The EQUIP program requires that there is mechanism for image quality improvement and
feedback. Fellowship trained breast imagers have more concentrated and longer training in
image quality than general radiologists. Additional training for both general and fellowship
trained radiologists in identifying image quality, with attention to positioning errors, will
enhance a breast imaging program and provide improved patient care.
-
Relationship between Obuchowski-
Rockette and Gallas U-statistic methods
for analyzing multi-reader diagnostic
imaging data
Stephen L. Hillis, PhD
Departments of Radiology & Biostatistics, University of Iowa
Rationale
The Obuchowski-Rockette (OR) and Gallas U-statistic (U-stat) methods have been the two most
frequently used methods for analyzing multireader multicase (MRMC) diagnostic imaging data
that allow conclusions to generalize to both the reader and case populations. The OR method is
the more general method because it can be used with any reader-performance measure, whereas
the U-stat method is limited to a U-statistic outcome, such as the empirical (or trapezoidal) AUC
statistic. On the other hand, advantages of the U-stat method are that it provides exact
expressions for the outcome variance, provides unbiased variance estimates, and makes it easy
to size future studies having a different abnormal-to-normal case ratio than was used in a pilot
study. However, previously it has not been clear if there is a direct link between the two
methods. In this talk I discuss a particular version of the OR method that produces the same test
statistic as the U-stats method
Methods
I discuss a new way to estimate the error covariances when using the OR model which utilizes
the U-statistic approach.
Results
I show analytically that this version of the OR method produces the same test statistic as the U-
stats method.
-
Conclusions
Showing that a U-stats analysis can be performed using the OR method is useful in several ways:
(1) Previously the U-stats method was previously limited to comparison of two modalities. Now
the U-stats method can be used for testing for equivalence for several modalities, because the OR
method allows for this. (2) The equivalence of the statistics establishes that there is now an
unbiased variance version of the OR method available for U-statistic outcomes. (3) For U-
statistic outcomes, it is now easy to use the OR method to compute sample size for studies
having a different abnormal-to-normal case ratio than was used in a pilot study. (4) Negative
variances using the U-stat method can be avoided by using the well-tested OR approach for
computing degrees of freedom and constraining the variance to be positive. (5) If researchers
want to analyze a U-statistic outcome, they no longer have to be concerned with the question of
which method is better?
-
The strength of the gist of the abnormal in the
unilateral and bilateral mammograms
Ziba Gandomkar*a , Ernest U. Ekpoa , Sarah J. Lewisa , Karla K. Evansb , Kriscia Tapiaa , Tong Lia, Seyedamir
Tavakoli Tabaa, Jeremy M. Wolfec , Patrick C. Brennana a Medical Imaging Sciences, Faculty of Health Sciences, University of Sydney, Sydney, NSW, Australia;
BreastScreen Reader Assessment Strategy (BREAST), University of Sydney, Sydney, NSW, Australia. b Department of Psychology, University of York, Heslington, York, UK. c Visual Attention Lab, Harvard Medical School, Cambridge, MA, USA.
Rationale
Experts can perceive the gist of the abnormal in the negative prior unilateral mammograms of women who subsequently
diagnosed with breast cancer. Here, we compared the strength of the gist from unilateral and bilateral mammograms.
Methods
Seventeen radiologists viewed 60 cases in two different experiments (GistUnilateral and GistBilateral). In GistUnilateral, 60
unilateral craniocaudal mammograms were presented in a randomly generated sequence for a half-second to the
radiologists, who were asked to provide an abnormality probability for each case on a scale from 0 (confident normal)
to 100 (confident abnormal). In GistBilateral, we presented bilateral mammograms of the same cases using a similar
experimental protocol. Readers were randomly assigned to two groups, the first did the unilateral experiment first while
the second group did the bilateral experiment first. Four categories of mammograms (15 cases per category) were
included: 1) Cancer cases, which contained biopsy-proven malignancies; 2) Normal cases, which remained normal at
least for next two years; 3) Prior_Vis cases, which contained retrospectively visible non-actionable cancer signs; 4)
Prior_Invis cases, which did not contain visible cancer signs. Mammograms from the last two groups were from women
who subsequently developed biopsy-proven malignancies. For each radiologist and each category, the Pearson
correlation between the unilateral and bilateral gist responses was calculated. In each experiment, three pair-wise
classifications, i.e. Cancer/Normal, Prior_Vis/Normal, Prior_Invis/Normal were analysed. A paired, two-sided
Wilcoxon Signed Rank test was used to investigate whether the values of area under receiver operating characteristic
curves (AUC) were at an above-chance (AUC=0.5) level. The same test was also used to show whether the AUC values
from two experiments differed significantly for each pair-wise classification. For each radiologist and each case, we also
calculated the average of the two gist responses recorded in the two experiments and produced GistAVE, i.e.
½(GistUnilateral+GistBilateral).
Results
The averages of correlation coefficient across 17 readers for Cancer, Normal, Prior_Vis, Prior_Invis, and all cases were
0.17 (CI=0.03-0.31), 0.26 (CI=0.09-0.43), 0.30 (CI=0.12-0.49), 0.35 (CI=0.21-0.49), and 0.35 (CI=0.25-0.44),
respectively. The order of median AUCs in Cancer/Normal and Prior_Vis/Normal classifications from the highest to the
lowest was GistAVE>GistUnilateral>GistBilateral. All differences except the difference for GistAVE and GistUnilateral in
Prior_Vis/Normal classification were significant. In Prior_Invis/Normal classification, the order was
GistAVE>GistBilateral>GistUnilateral. None of the differences in the AUC values for Prior_Invis/Normal classification were
significant. On average, the AUCs of Cancer/Normal, Prior_Vis/Normal, Prior_Invis/Normal classifications based on
GistUnilateral respectively dropped by 8%±6%, 10%±8%, and 1%±8% in the bilateral experiment while these AUCs
increased by 5%±3%, 2%±4%, and 4%±6% after averaging two signals. On average, the AUCs of Cancer/Normal,
Prior_Vis/Normal, Prior_Invis/Normal classifications based on GistAVE were 82%±4%, 74%±3%, and 67%±5%.
Conclusions
There is weak association between the gist signal from unilateral and bilateral mammograms. The signal was stronger in
the unilateral experiment. When two signals were averaged, the AUCs increased. The improvement could be as a result
-
of cancelling out random noise by averaging two values. Further investigation of intra-reader variability and exploring
the AUC when unilateral gist responses of a reader were averaged in multiple experiments is required.
-
Perceptual Training –
Learning versus Attentional Shift
Soham Banerjee, MD [1]; Megan Mills, MD [1]; Trafton Drew, PhD [2];
William F. Auffermann, MD/PhD [1*]
[1] Department of Radiology and Imaging Sciences, University of Utah Health, Salt Lake City,
UT, USA; [2] Department of Psychology, University of Utah, Salt Lake City, UT, USA;
[*] Corresponding Author
Rationale: Perceptual training (PT) has been shown to improve healthcare trainees’ ability to identify
abnormalities on chest radiography (CXR). Specifically, recent studies have examined the
effects of search pattern training, and showed improved performance with training. However, it
was not clear if the improved performance was due to learning, or due to an attentional shift
resulting from queuing related to the training. The objective of this study is to determine
whether improved subject performance on CXR evaluation after PT is due to learning or
attentional shift.
Methods: A perceptual training experiment with 41 physician assistant trainees was performed. All
subjects voluntarily participated and provided informed consent. Subjects evaluated CXRs for
appropriate central venous catheter (CVC) positioning and other imaging related tasks before and
after educational interventions. For the intervention, the control group received an attentional
control task, and the experimental group received perceptual training in the form of search
pattern training for CVC characterization.
Many of the subjects' tasks were similar to prior studies and included: 1) Marking the tip of the
catheter, 2) Indicating their confidence in catheter tip localization, 3) Indicating whether the
catheter was adequately positioned or malpositioned.
In addition, subjects were asked to rate whether the cardiac silhouette was normal or abnormally
enlarged using a 5-point scale. Information on how to perform cardiac evaluation was given
only at the beginning of the study with the study’s introductory materials.
Subject ability to characterize the adequacy of catheter positioning (Line-Safe) and the heart size
(Heart-Size) were quantified using receiver operating characteristic (ROC) analysis. Subject
ability to localize the catheter tip (Line-Loc) was quantified using localization ROC (LROC)
analysis. The figure of merit for performance was the area under the curve (AUC).
-
Results: The difference in AUC for subject performance before and after the educational intervention and
the corresponding p-values are given in the table below.
Line-
Loc
Line-
Loc
Line-
Safe
Line-
Safe
Heart-
Size
Heart-
Size
ΔAUC P-Value ΔAUC P-Value ΔAUC P-Value
Control -0.11 0.88 0.06 0.01 0.04 0.01
Experimental 0.30
-
RadSimP - A Custom Software Solution
for Perceptual Training Compared with
Current Perceptual Software
Soham Banerjee, MD [1]; Megan Mills, MD [1]; Trafton Drew, PhD [2];
William F. Auffermann, MD/PhD [1*]
[1] Department of Radiology and Imaging Sciences, University of Utah Health, Salt Lake City,
UT, USA; [2] Department of Psychology, University of Utah, Salt Lake City, UT, USA;
[*] Corresponding Author
Rationale: Recent studies have shown the utility of perceptual training (PT) for teaching healthcare
trainees good perceptual habits when evaluating medical images. Prior studies were performed
using software designed for perceptual observer studies. As most software packages for image
perception are geared towards research, they were not optimized for perceptual training and
assessment. To date, there had been no software packages specifically designed for perceptual
training. The goal of this study is to determine if perceptual training using our custom software
solution, RadSimP, resulted in improved performance relative to training using current
perceptual research software.
Methods: PT for central venous catheter (CVC) positioning was performed using a counterbalanced
design. Subjects were shown several sets of chest radiographs (CXRs) with CVCs that were
either adequately or malpositioned. Subjects were asked to: mark the tip of the catheters, rate
their confidence in catheter tip localization, and state whether or not the catheters were
adequately positioned.
The same study was conducted twice using two different PT software packages. Study-A used
ViewDEX (https://sas.vgregion.se/en/for-dig-som-ar/vardgivare/viewdex/), a software package
for perceptual research. Study-B used RadSimP, our custom perceptual training and radiology
workstation simulator software package, written in Python.
All subjects voluntarily participated and provided informed consent. For Study-A, 14 physician
assistant students participated. For Study-B, 41 physician assistant students participated.
Training and assessment was done at individual computer workstations in an educational
computer classroom.
During Study-A, the trainees had to manually switch between folders on the desktop to access
the appropriate educational materials and had to manually enter information to get the correct
set of cases for assessment. For Study-B, the RadSimP program seamlessly integrated subject
-
consent, training, practice, and assessment in a simulated radiology workstation environment.
In addition, RadSimP was loaded onto the classroom’s network storage drive, such that all
subjects could run it simultaneously, and the results were automatically collected in a central
location.
A survey was given to subjects after the completion of the study to assess the subjects’
impressions of perceptual training and the RadSimP software package. The survey asked if
subjects felt the search pattern training and simulator environment were helpful for learning
about radiology. Responses were collected using a 5-point Likert response format (where 5
indicates strongly agree).
Results: Using both training paradigms, the subjects in the experimental group showed a statistically
significant improvement in their ability to characterize a catheter as acceptable versus
malpositioned. The difference in areas under the localization receiver operator characteristic
curves were 0.07 using the conventional software and 0.1 using RadSimP, p-values of 0.02 and
-
Meaningful Feedback in Breast Imaging
Simulation Assessment
Lonie R Salkowski, MD MS PhD1,2, Mai A Elezaby MD1, Elizabeth A Krupinski, PhD3
University of Wisconsin School of Medicine, Department of Radiology1
University of Wisconsin School of Medicine, Department of Medical Physics2
Emory University, Department of Radiology & Imaging Sciences3
Rationale There is a lack of objective assessment on the quality of interpretive skills in radiology residency
training. This is especially pertinent in breast imaging where there is no independent
interpretation of exams during residency and the majority of residents will not pursue a breast
imaging fellowship. Simulation is a validated technique which facilitates independent
interpretation of thoughtfully developed clinical cases with sequential exposure during residency
training. The format of meaningful feedback should provide both formative and summative
information that is beneficial and not punitive for residents.
Methods We developed a breast imaging simulation to provide serial feedback for residents over the four
years of their radiology residency training. Users will be provided feedback in several formats.
First, a modified medical audit (recall rate, cancer detection rate, sensitivity, specificity, PPV1, PPV2, and PPV3) that introduces the residents to the typical MQSA mandated annual feedback
for radiologists that interpret mammograms. This audit will have higher educational impact when
the data are presented in the context of the resident’s own work and preparing them to reach
national performance benchmarks. Second, the users will be provided feedback on their
assessment of lesion types (masses, calcifications, asymmetries, architectural distortion). This
will objectively highlight areas that may need additional review and emphasis in the educational
program for users and clinical educators. Third, since breast density is a recently introduced
national concern and has implications for clinical practice (from the notification of patients about
their tissue density to the offering of additional clinical tests for patients with high tissue
density), it will be important to assess the user’s understanding of breast density. Lastly, to the
keep the users motivated and provide an element of competitive playfulness in a residency
program, a gamification component has been added to the assessment. This will lead to more
engagement by residents with the intent that they will be better prepared for independent
interpretation at the completion of the residency. This gamification component will provide
overall scores for assessments that will be displayed on a leader board (user self-assigned names)
format for comparison to peers.
Results Within the context of the assessments developed in the simulation (modified medical audit,
lesion type assessment, breast tissue density, and gamification) the results will be correlated with
resident level (first, second, third rotations) targets, and how residents’ in-training medical audits
compare to the national benchmarks trajectory. The residents will be serially debriefed with
-
formative and summative results, and as data is collected peer-level comparisons will be
provided.
Conclusions Thoughtful development of objective assessment measures in a simulation design with feedback
is important, in order to provide users with a meaningful simulation experience.
-
Optimality of tool selection in radiologists and naïve subjects
Lisa M. Heisterberg, BS & Andrew B. Leber, PhD
The Ohio State University, Department of Psychology & Medical Scientist Training Program
Rationale While multiple avenues of research have investigated errors in radiology, one possible cause of errors that has received limited study is the way in which radiologists interact with Picture Achieving and Communication System (PACS) software. PACS software contain critical elements that allow radiologists to view and manipulate medical images, but their many tools and features can put individuals at risk for making sub-optimal choices. Acquiring radiologist subjects, and the high complexity of PACS software can make researching this topic difficult. Hence, we have developed a simple laboratory based visual search task that approximates windowing, a PACS feature that allows for contrast enhancement of images. We sought to determine if non-expert performance in our task could inform us about radiologist performance, and if subjects would approach our task optimally.
Methods 26 radiologists and 26 subjects naïve to radiological image interpretation completed our study. Each trial tasked subjects with deciding if a letter T was present or absent in displays containing distractor Ls. One of three classes of displays were shown on each trial. For the 80% of T present trials, the T was not always immediately visible. Each display was initially shown with a default contrast adjustment setting applied that revealed the target on 5% of trials. Subjects could select from 3 additional adjustment settings; an optimal setting that revealed the target on 75% of trials, and 2 other settings that each displayed the target on 10% of trials. Subjects were pre-informed which adjustment setting was optimal for each display class. Selecting the optimal setting first, then selecting additional settings if the target was not found, would allow for the most accurate and efficient search.
Results Accuracy for reporting the absence or presence of a target was not significantly different for radiologists (85.3%) and naïve subjects (83.5%). Naïve subjects spent significantly less time per trial (13.0s) than radiologists (16.5s). The percentage of the time the optimal adjustment setting was selected first was not significantly different between radiologists (79.6%) and naïve subjects (86.7%). For both radiologists and naïve subjects, those that more often selected the optimal setting first had significantly faster average trial completion times, with no differences in accuracy. Lastly, on target present trials where the target was not visible with the optimal setting, subjects were significantly more likely to decide a target was absent; indicating that if a target was not visible using the optimal setting, subjects often neglected selecting other settings or continuing their search.
Conclusions These results demonstrate that in our simplified search task, radiologists are not completely accurate, are more efficient when making optimal choices, and can display sub-optimal behaviors; all of which are similar to naïve subjects. Such results reveal that the performance of non-radiologist subjects can inform us about radiologist performance in our task. Future studies will correlate radiologist performance in this simplified task with their performance using professional PACS. Overall we hope to understand how radiologists interact with PACS software, why they may act sub-optimally, and how sub-optimal behaviors can be reduced.
-
Reducing Errors in Pathology Image-
based Decisions through Maximum
Confidence Slating
Jennifer S. Trueblood1, PhD, William R. Holmes2, PhD, Adam C. Seegmiller3, MD, PhD,
Charles Stratton3, MD, Quentin Eichbaum3, MD, PhD 1Department of Psychology, Vanderbilt University, 2Department of Physics and Astronomy,
Vanderbilt University, 3Department of Pathology, Microbiology and Immunology, Vanderbilt
University Medical Center
Rationale Second opinions can significantly improve diagnostic accuracy. However, multiple readings by
different individuals are not always feasible due to shortages in pathology and laboratory medicine
workforce, particularly in low resource settings. We examine whether it is possible to reduce errors
by having the same person perform multiple readings. Research in decision-making has shown a
“wisdom of the crowd within” effect, improving accuracy by aggregating responses from a single
individual. We apply a similar strategy to decisions about the pathology images.
Methods In two experiments, participants (novices in Exp 1 and experts in Exp 2) viewed images of white
blood cells and decided if it contained a blast cell (pathological white blood cell) or not. On each
trial, participants were asked to make a binary choice followed by a confidence rating, indicating
their confidence in their decision. Participants viewed each image twice.
Results Results showed confidence was greater for correct as compared to incorrect responses (Exp 1: F(1,
36) = 106.83, p < .001 and Exp 2: F(1, 21) = 33.77, p < .001). We then applied a maximum
confidence slating algorithm (MAX, Koriat, 2012) to each individual’s decisions. For each image,
MAX selects the trial with the higher confidence. We compared this approach with average
performance (AP) as well as a minimum confidence slating (MIN) algorithm that selects the lower
confidence response for each image. We found a main effect of algorithm (Exp 1: F(2, 72) = 29.41,
p < .001 and Exp 2: F(2, 42) = 12.21, p < .001), with post hoc tests showing that MIN generated
lower accuracy than AP, and AP generated lower accuracy than MAX.
Conclusions In sum, our results show that confidence is associated with accuracy in pathology decisions and
suggests that it can be used as a way to aggregate multiple readings within the same individual.
-
A novel learning-based paradigm to
investigate the visual-cognitive bases of
lung nodule detection
Frank Tong1,2, Ph.D., Malerie G. McDowell1, B.A., William R. Winter3,
M.D., M.S.,
and Edwin F. Donnelly3, M.D., Ph.D
1 Psychology Department, Vanderbilt University
2 Vanderbilt Vision Research Center, Vanderbilt University
3 Department of Radiology, Vanderbilt University Medical School
Rationale: Even expert radiologists will sometimes fail to detect the presence of a pulmonary
nodule in a chest X-ray image, with estimated rates of missed detection of 20-30%.
The challenging nature of this diagnostic task lies not only in the visual contrast or
the size of the nodule, but also in the heterogeneity of nodule appearance and the
variability of the local anatomical background. The goal of our study was to
develop a learning-based paradigm, using image processing software to generate a
large, heterogeneous set of visually realistic simulated nodules, to gain insight into
the visual and cognitive bases of lung nodule detection.
Methods: The current version of our software allows for the creation of simulated nodules
with heterogeneous appearance, allowing for rigorous control over the size, shape,
brightness, contrast, and placement of nodules in 2D chest radiographs.
Results: At the MIP Lab at RSNA, we tested radiologist participants (n=10) with both real
and computer-simulated nodules at a challenging nodule localization task.
Performance accuracy was significantly better for real nodules than for the subtle
simulated nodules we created (70.5% vs. 59.0% accuracy, p < 0.005). Of greater
-
interest, radiologists performed no greater than chance level at discriminating
whether nodules were real or simulated (mean accuracy 52.9%). Next, we
evaluated the impact of training naive undergraduate participants at a localization
task involving simulated nodules. Participants underwent 3-4 training sessions and
viewed a total of 600 simulated cases. We observed significant improvements
following training for both simulated nodules (30.3% accuracy pre-test, 78.2%
accuracy post-test, p < 0.00001) and real nodules (37.5% pre-test, 62.5% accuracy
post-test, p < 0.0005.). In our next experiment, we investigated whether extended
training with either light or dark polarity nodules would lead to polarity-specific
training benefits in initially naive undergraduates. This indeed proved to be the
case, implying that this training regimen led to the learning of a polarity-specific
perceptual template of nodule appearance. Finally, we conducted an exploratory
pilot study with 6 radiology residents to see whether they might show performance
improvements following training with our nodule localization task. The results of
this initial pilot revealed a highly significant improvement in performance with
simulated nodules on the final test day, and a non-significant trend of improvement
for real nodule cases.
Conclusions: Taken together, our results demonstrate that marked improvements in nodule
detection can be achieved by implementing a training regimen with numerous
realistic examples, and moreover, that trained undergraduates can serve as useful
model observers for investigating the visual-cognitive bases of nodule detection.
With continued refinement of our simulation methods and training set of images,
we anticipate that it should be possible to further boost generalization of these
training benefits to real nodule test cases. Future developments of this nodule
localization training paradigm could prove useful as a software tool for enhancing
the diagnostic training of radiology residents.
-
The Importance of Peripheral Visual
Processing and Eye Movements in Search
with 3D Images Miguel P. Eckstein, Miguel A. Lago, Craig K. Abbey
Department of Psychological & Brain Sciences, UC Santa Barbara, Santa Barbara, CA. 93106, USA
RATIONALE When radiologists use a 3D imaging modality to diagnose a disease they often read the data as a
stack of 2D slices and scroll through the slices. The foveated nature of the human visual system
and the typical reading times prevent radiologists from exhaustively exploring all regions of the
image set with their high-resolution fovea. Thus, radiologists must rely on vision away from the
fovea (the visual periphery) to process many regions of the images. Here, we investigate how
target detectability varies with retinal eccentricity and explore eye movement patterns and
detection accuracy during 3D search.
METHODS We measured target detectability of various targets (small and large targets) briefly presented at a
known location (50% probability) in filtered noise and digital breast tomosynthesis phantoms.
Eye position monitoring allowed us to ensure that observers maintained gaze on a fixation point.
In a separate study, observers searched for the large and small targets (50 % probability of target
presence) in 3D volumetric images with the two backgrounds. Observers were given unlimited
time to scroll and search. We measured search accuracy (true positive rate, false positive rate),
eye movements and scrolls.
RESULTS The results show strong dissociations on detectability in the visual periphery across large and
small targets for both synthetic textures and DBT phantoms. Detectability for the small target
degraded abruptly in the visual periphery while that of the larger target reduced more
moderately. For the 3D search, participants were unable to explore significant portions of the
data with fixational eye movements suggesting that they relied on peripheral processing for their
decisions. We found that 3D search led to a significant reduction in target detectability of the
small targets. We found large variability in human performance detecting the small targets in 3D
search. Individual detectabilities for the small signals in 3D search were related to the observers’
eye movements: observers’ search accuracies were inversely correlated with the average closest
distance of the observers’ fovea to the signal. Detectability did not correlate with search times.
CONCLUSION For 3D imaging modalities, the properties of the human visual periphery and eye movements are
critical in determining the detectability of searched targets and might play an important role
determining individual variability in search accuracy.
-
Foveated Model Observers applied to DBT
image phantoms Miguel A. Lago1, Bruno B. Barufaldi2, Predrag R. Bakic2, Craig K. Abbey1,Susan P. Weinstein2, Brian
Englander2, Andrew D. Maidment2, Miguel P. Eckstein1
1Department of Psychological & Brain Sciences, UC Santa Barbara, Santa Barbara, CA, USA 2Department of Radiology, University of Pennsylvania, Philadelphia, PA, USA
RATIONALE Digital Breast Tomosynthesis (DBT) is becoming the standard for breast imaging. New 3D imaging
modalities bring large volumes of data that cannot be exhaustively explored with eye movement fixations.
Thus, radiologists read the images with visual processing away from the fovea (visual periphery).
However, accuracy in the visual periphery is degraded relative to foveal processing. Current model
observers (Channelized Hotelling and Non-Prewhitening Matched Filter with an Eye Filter) only model
visual processing at the fovea and might not be sufficient to account for human performance with 3D
imaging modalities. Here, we propose a new Foveated Channelized Hotelling Observer (FCHO) that
incorporates vision across the entire visual field with reduced spatial detail away from the fovea. We
compare the FCHO model to traditional non-foveated model observers in their ability to predict human
performance when searching for simulated microcalcifications and masses in breast phantoms.
METHODS We designed an experiment consisting of a free search of a simulated signal (microcalcification or mass)
within the UPENN DBT phantom. We ran 12 radiologists in 28 trials of a 3D DBT (64 slices) search and
28 trials of a single slice 2D DBT search with 50% signal presence. Additionally, we trained our FCHO
to search for the simulated signals within the breast phantoms. Our model analyzes the visual field with
different templates at different distances from the fixation point. It also includes an eye movement model
that selects the next fixation point and a scrolling model that goes through the slices of the volumetric
image. We compared human and model accuracy for a single (central) slice of DBT and the complete 3D
DBT.
RESULTS Human observer performance shows a significantly lower detectability for the microcalcification in 3D
search (d'=1.36±0.33) compared with the 2D (d'=3.78±0.54) while masses do not show a significant
difference (2D: d’=1.89±0.38; 3D: d’=1.65±0.40). The FCHO model performance shows an agreement
with these results and is capable to capture the interactions in detectability between the signal type
(microcalcification vs. mass) and the search task (2D vs. 3D) that we also have seen with previous results
for our FCHO model for images with correlated Gaussian noise.
CONCLUSION We presented the first application of the FCHO to more realistic images containing structures, different
tissues, and backgrounds that are more complex. The FCHO model correctly predicted the dissociation
in results across signal types while traditional model observers did not. The results motivate the use of
foveation in model observers in order to assess image quality in 3D DBTs.
-
The Role of Comparison in Categorization
Learning of Chest X-ray: An Eye
Movement Study
Yanju Ren, Ph.D.1; Yuanjie Zheng, Ph.D.2 1School of Psychology, 2School of Information Science and Engineering, Shandong Normal
University, Jinan, P. R. China; E-mail: [email protected]; [email protected]
Rationale: The chest X-ray is one of the most commonly accessible radiological examinations for
screening and diagnosis of many lung diseases. How a Novice observer learn to classify a
chest X-ray as normal or abnormal (furthermore, Benign or malignant) is an important
research topic in the domain of medical education. Comparison learning is one of the key
processes by which people learn and also is broadly found to be very effective in the context
of, for example, category learning. So the present study is to explore the role of comparison
in chest X-ray categorization using eye tracking technology.
Methods: To this end, two eye movement experiments were conducted. Experiment 1 consisted of three
phases. In the pretest phase, forty-eight undergraduate students participated in the chest X-ray
categorization (normal or abnormal) task to obtain the baseline performance. In the second
phase, an half of participants (24 out of 48) was assigned to comparative learning condition
and the other half of participants was assigned to non-comparative learning condition. In the
test phase, the participants perform the categorization task. Experiment 2 also contained
similar three phases, only comparison task was employed and presentation time of the chest
X-ray was manipulated.
Results: In Experiment 1, the participants category the chest X-ray at the chance level in the pretest
phase, and by comparison learning, the comparative learning group obtained remarkable
improvement in chest X-ray categorization task in the test phase, reflected in the shorter
reaction time, less number of saccade, shorter fixation duration, and longer saccade amplitude
etc. The participants from long presentation time group get the better categorization
performance than those from short one.
Conclusions: The two eye tracking experiments demonstrate that learning style and presentation time of
chest X-ray have important roles in medical image categorization.
-
Investigating Observer Gaze Patterns on Facial
Disfigurements from Head and Neck Cancer
Krista M. Nicklaus, MSE1,2, Enrique Callado3, Joowon Cho, MSE3, Jun Liu, PhD2, Mary Catherine Bordes,
BS2, Gregory P. Reece, MD2, Summer E. Hanson, MD, PhD2, Jeffery M. Engelmann, PhD4,
Mia K. Markey, PhD1,5
1Biomedical Engineering, The University of Texas at Austin, 2Plastic Surgery, The University of Texas MD
Anderson Cancer Center, 3Electrical Engineering, The University of Texas at Austin, 4Psychiatry and
Behavioral Medicine, Medical College of Wisconsin, 5Imaging Physics, The University of Texas MD Anderson
Cancer Center
Rationale
Facial disfigurement resulting from head and neck cancer can have devastating effects on psychosocial
functioning. Body image changes experienced by head and neck cancer patients contribute to high levels of
depression and anxiety, social isolation, impaired quality of life, and sexual difficulties. Many head and neck
cancer patients feel discounted or stigmatized, are preoccupied by appearance changes, or avoid social
situations due to changes in appearance and functioning. Feeling stigmatized can arise from concerns about how
others will react to one’s appearance as well as how others actually react. Individuals with facial disfigurement
are highly aware of how others behave toward them (e.g., staring, gaze aversion, unwelcomed comments) and
facial expressions perceived to convey negative emotional reactions (i.e., disgust). Our long-term goal is to
support head and neck cancer patients in developing realistic expectations about how other people in non-social
group settings (e.g., customers in a grocery store) will respond to their appearance by presenting selected
information from a normative database of the responses of lay observers to facial disfigurement resulting from
head and neck cancer. While several methodologies could be used to study observers’ cognitive, behavioral, and
emotional responses to facial disfigurement, eye tracking has the potential to help us understand behavioral
responses such as staring and gaze aversion. This preliminary study examines lay observers’ gaze patterns when
looking at clinical photographs of head and neck cancer patients.
Methods
Eye movements were recorded and tracked with a Tobii TX300 Eye tracker (Tobii Technology Inc., Falls
Church, VA), with a sampling rate of 300 Hz. 20 lay observers viewed 144 face images for 6 seconds each (4
images were used for practice). The images are from 35 head and neck cancer patients with varying degrees of
disfigurement over multiple time points from 1 to 12 months post face reconstruction. Two clinical experts
determined whether the facial disfigurement was on the left or right side of the face, or undetermined. The
midline of the face was defined by the line from the central hairline, through the pronasale, to the lowest point
of the chin. Gaze fixations and saccades were mapped using the EyeMMV toolbox in MATLAB (Mathworks,
WA, USA). Fixation location, dwell time, and saccades were investigated in relation to the location of
disfigurement.
Results
The locations of fixations, duration of fixations, and saccades were mapped to each image. There was
substantial variation in the gaze patterns across observers and stimuli.
Conclusions
-
Eye tracking data has the potential to identify features of disfigured faces that attract attention of lay observers.
However, substantial inter-observer variability suggests that future work is needed to investigate factors that
influence lay people’s cognitive, behavioral, and emotional responses to facial disfigurement. Factors to
consider include quantitative measures of facial disfigurement; the lay person’s body image and affective state;
and the layperson’s demographic variables.
-
Influence of radiology expertise on the perception of nonmedical images Brendan Kelly, Louise A. Rainford, Mark F. McEntee, Eoin C. Kavanagh
Abstract Identifying if participants with differing diagnostic accuracy and visual search behavior during radiologic tasks also differ in nonradiologic tasks is investigated. Four clinician groups with different radiologic experience were used: a reference expert group of five consultant radiologists, four radiology registrars, five senior house officers, and six interns. Each of the four clinician groups is known to have significantly different performance in the identification of pneumothoraces in chest x-ray. Each of the 20 participants was shown 6 nonradiologic images (3 maps and 3 sets of geometric shapes) and was asked to perform search tasks. Eye movements were recorded with a Tobii TX300 (Tobii Technology, Stockholm, Sweden) eye tracker. Four eye-tracking metrics were analyzed. Variables were compared to identify any differences among the groups. All data were compared by using nonparametric methods of analysis. The average number of targets identified in the maps did not change among groups
[ mean=5.8mean=5.8 of 6 targets (range 5.6 to 6 p=0.861p=0.861 )]. None of the four eye-tracking metrics investigated varied with experience in either search task ( p>0.5p>0.5 ). Despite clear differences in radiologic experience, these clinician groups showed no difference in nonradiologic search pattern behavior or skill across complex images. This is another viewpoint adding to the evidence that radiologic image interpretation is a learned skill and is task specific.
-
Variations in Lung Nodule Detection and Functional Visual Field of Radiologists
Geoffrey D. Rubin, MD, MBA, Brian Harrawood, Kingshuk RoyChoudhury, PhD,
Justus E. Roos, MD, Martin Tall, Sandy Napel, PhD Departments of Radiology, Duke University and Stanford University
Rationale The foveal gaze of radiologists is exposed on average to only 27% of the lung volume, yet 76% of imbedded lung nodules are included in that volume. This suggests that radiologists’ functional visual field (FVF) for lung nodule detection in CT scans extends well beyond the limits of central gaze (within a 5° gaze angle). To better characterize radiologists’ FVF while dynamically scrolling through CT scans, we measured the distance between radiologists’ gaze point and a lung nodule immediately prior to its formal detection at the “moment of recognition”.
Methods Time-varying gaze traces acquired from 13 radiologists using unconstrained stacked transverse section paging and eye tracking during the interpretation of 40 chest CT scans enriched with 157 simulated 5-mm solid lung nodules were subdivided into periods of nodule visibility (exposures). “Gaze distances” were measured between gaze points and lung nodules to quantify their relationship with nodule exposure duration and detection. The moment of recognition (MoR) was defined as the time point immediately preceding the saccade that converged upon and resulted in the immediate detection of a nodule. MoR distances were measured and characterized as central (foveal) versus peripheral vision based upon a 5° gaze angle threshold.
Results There were 9,751 nodule exposures, defined as discrete periods of nodule visibility exclusive of those following a detection, that consumed 6% of the total search time. 3,371 of these exposures resulted in the detection of 997 TP nodules (49% detection rate). The duration of exposure to undetected (false negative) nodules was 3.5 times longer than TP nodules (p
-
Figure: (A) Free longitudinal (z) search path from a single reader examining a chest CT scan with three 5-mm lung nodules centered on the orange lines and visible over the faded orange bands. Red regions indicate periods when nodules were displayed on visualized cross-sections but were not detected. The green region indicates the period when one of the three nodules was detected by the reader. The three nodules were visible for 3.1, 3.3, and 3.4% of the search time and were exposed to central gaze for 0.2, 0.0, and 0.1%, respectively, across the 357 second search duration. (B) The 3.5 second region contained within the green zone in (A) is magnified and displayed with the corresponding gaze point samples. Vertical gray zones indicate regions where the target lung nodule is not visible at the extremes of the slab through which the subject scrolls. Selected time points (1-6) are illustrated with corresponding CT section, gaze point (red circle with 50-pixel diameter), target (orange circle), and acceptance of the detection (green circle). The subject is positioned such that central gaze (5° gaze angle) is within 90 pixels of the gaze point. At the beginning of the trace, the nodule is not visible, but the subject scrolls down and the nodule is revealed when the gaze is 353 pixels away (1). The gaze then deviates closer to the x, y position of the nodule (2), but moves back to the posterior lung (3). Following a saccade, the gaze shifts anteriorly to within 164 pixels of the nodule (4). Another saccade ensues bringing the gaze within 50 pixels of the nodule, just as the viewer scrolls beyond the nodule, reverses scroll direction and lands on the nodule (5). After 1 second scrutinizing the nodule, it is accepted (6). Based upon the location of the final saccade converging on the nodule, the moment of recognition is classified to occur at the dotted black line and the preceding time period is considered to be search while the subsequent time period is considered to be decision making.
-
Visual search behavior reveals differences
in diagnostic accuracy based on
experience
Joe Thomas, BSc1, Bradley Fawver, PhD1, Megan Mills, MD2, William Auffermann, MD, PhD2,
Trafton Drew, PhD3, and A. Mark Williams, PhD1
1 University of Utah; Department of Health, Kinesiology & Recreation 2
University of Utah; Department of Radiology and Imaging Sciences 3University of Utah; Department of Psychology
Rationale
A substantial number of medical errors in radiology are attributed to failures of perception or
failures of decision making. Although it is believed that experience in diagnostic imaging
naturally leads to the development of expertise, data from other medical fields suggests this may
not be the case. The purpose of this study was to explore how diagnostic accuracy differs across
radiology professionals as a function of experience, as well as ascertain the extent to which
changes in visual search behaviors underlie improved diagnostic outcomes.
Methods
Twenty radiologists (5 Attending, 5 Fellows, 10 Residents) dictated their findings on 10
musculoskeletal cases (negative and abnormal cases included) obtained from a medical database.
Mobile eye-tracking glasses sampled gaze behavior at 120 Hz, while Likert-scale measures of
mental effort and confidence were obtained after each case. Key areas of interest (i.e., where the
abnormality was located) were identified on each abnormal case, and two radiologists coded
accuracy. Simple linear regressions were utilized to explore relationships between experience
(i.e., resident, fellow, attending physician), diagnostic outcomes (e.g., trial time, accuracy), and
attentional processes (e.g., fixation, saccadic behavior).
Results
Participants demonstrated an 89% accurate rate on negative cases and a 67% accurate rate on
present cases, so analyses proceeded exclusively on abnormal cases. Attending physicians
exhibited only marginally improved diagnostic accuracy on abnormal cases (67%) compared to
individuals in the resident program (61%). Level of experience was associated with reduced trial
time (p < .001) and increased confidence in the diagnosis (p = .004). More experienced
individuals demonstrated fewer fixations (p =.001) of shorter duration (p =.003) on the dictation
screen, fewer fixations the medical images (p < .001), and fewer fixations on key areas of
interest (p = .002). Experience was also associated with increased saccadic amplitude (p = .007)
-
and decreased peak saccadic velocity (p < .001). After controlling for experience, the total
number (p = .001), duration (p = .015), and percentage of fixations (p = .004) on key areas of
interest was associated with improved diagnostic accuracy.
Conclusion
As expected, experienced radiologists spent less time diagnosing each case and were more
confident in their diagnosis. Experience was also associated with more purposeful visual search
behavior on the images and more efficient use of medical imaging technology. However, while
time spent viewing information-rich areas of the medical images (i.e., the abnormality) was
positively associated with diagnostic accuracy, it was negatively associated with experience.
Findings suggest a physician’s confidence in their diagnosis might be misplaced when cases are
dictated too quickly or when individuals spend insufficient time extracting relevant information
from key areas of the visual display.
-
Impact of expertise on reading mammograms: An eye-tracking study
Lucie Lévêque1,2 (MSc), Hilde Bosmans3 (PhD), Lesley Cockmartin3 (PhD), Hantao Liu2 (PhD)
1School of Computer Science and Informatics, Cardiff University, United Kingdom
2Department of Computer Science and Software Engineering, Xi’an Jiatong Liverpool University, China
3Department of Radiology, University Hospitals KU Leuven, Belgium
Rationale
Breast cancer screening uses low-dose x-rays to detect cancers early, and thus to allow a more efficient treatment. It is critical to understand how medical professionals perceive and interpret mammograms with a view to reduce errors in screening mammography. Various eye-tracking studies have been undertaken in this area, presenting different experimental designs (e.g, films vs. digital mammograms, public databases vs. selected cases). A prominent topic in the literature is the comparison between experienced and less experienced readers.
Methods
An eye-tracking experiment was conducted with several expert radiologists, trainee radiologists, and physicists, who were asked to read 196 medio-lateral oblique (MLO) mammogram views from 98 patients. The cases were free of lesions, but the readers were not informed about this fact. After reading both left and right images of a case, the participants had to answer the following question: “refer or not refer?” by focusing their gaze on one of these options on the screen. The eye movements of the participants were recorded using a non-invasive SMI Red-m eye-tracking system.
Results
Gaze information was extracted from the raw eye-tracking data obtained during the experiment, including the number of fixations per stimulus, their coordinates and duration. An analysis of variance (ANOVA) was used to study the similarity between the three expert radiologists in terms of mean fixation duration. Results show no statistically significant difference between the three expert radiologists (i.e., p
-
Fig. 1: Illustration of the mean fixation duration of expert radiologists R1, R2 and R3 (in red), trainee radiologists T1, T2 and T3 (in green), and physicists P1 and P2 (in blue), averaged over all fixations recorded for all test stimuli. Error bars indicate a 95% confidence interval.
Saliency maps, i.e., topographic representations indicating conspicuousness of scene locations, were created using the fixations obtained from the eye-tracking experiment. Each fixation location gave rise to a greyscale patch simulating the foveal vision of the human system. In a saliency map, salient regions represent where the observers focused their gaze with a higher frequency. It can be noticed on the maps that expert and trainee radiologists’ gaze patterns are concentrated, whereas physicists’ gaze patterns are more distributed over the mammogram.
Conclusions
An eye-tracking experiment was designed and conducted to study the impact of medical specialties and level of experience on perceptual behaviour while interpreting mammograms. Results showed that physicists have, in general, a higher dwell time than experts, whereas trainees have a lower dwell time. Furthermore, the physicists gaze patterns were more dispersed than that of the radiologists, whereas the trainees showed similar patterns to that of the radiologists.
-
The strength of the gist of the abnormal in the
unilateral and bilateral mammograms
Ziba Gandomkar*a , Ernest U. Ekpoa , Sarah J. Lewisa , Karla K. Evansb , Kriscia Tapiaa , Tong Lia, Seyedamir
Tavakoli Tabaa, Jeremy M. Wolfec , Patrick C. Brennana a Medical Imaging Sciences, Faculty of Health Sciences, University of Sydney, Sydney, NSW, Australia;
BreastScreen Reader Assessment Strategy (BREAST), University of Sydney, Sydney, NSW, Australia. b Department of Psychology, University of York, Heslington, York, UK. c Visual Attention Lab, Harvard Medical School, Cambridge, MA, USA.
Rationale Experts can perceive the gist of the abnormal in the negative prior unilateral mammograms of women who subsequently
diagnosed with breast cancer. Here, we compared the strength of the gist from unilateral and bilateral mammograms.
Methods Seventeen radiologists viewed 60 cases in two different experiments (GistUnilateral and GistBilateral). In GistUnilateral, 60
unilateral craniocaudal mammograms were presented in a randomly generated sequence for a half-second to the
radiologists, who were asked to provide an abnormality probability for each case on a scale from 0 (confident normal)
to 100 (confident abnormal). In GistBilateral, we presented bilateral mammograms of the same cases using a similar
experimental protocol. Readers were randomly assigned to two groups, the first did the unilateral experiment first while
the second group did the bilateral experiment first. Four categories of mammograms (15 cases per category) were
included: 1) Cancer cases, which contained biopsy-proven malignancies; 2) Normal cases, which remained normal at
least for next two years; 3) Prior_Vis cases, which contained retrospectively visible non-actionable cancer signs; 4)
Prior_Invis cases, which did not contain visible cancer signs. Mammograms from the last two groups were from women
who subsequently developed biopsy-proven malignancies. For each radiologist and each category, the Pearson
correlation between the unilateral and bilateral gist responses was calculated. In each experiment, three pair-wise
classifications, i.e. Cancer/Normal, Prior_Vis/Normal, Prior_Invis/Normal were analysed. A paired, two-sided
Wilcoxon Signed Rank test was used to investigate whether the values of area under receiver operating characteristic
curves (AUC) were at an above-chance (AUC=0.5) level. The same test was also used to show whether the AUC values
from two experiments differed significantly for each pair-wise classification. For each radiologist and each case, we also
calculated the average of the two gist responses recorded in the two experiments and produced GistAVE, i.e.
½(GistUnilateral+GistBilateral).
Results The averages of correlation coefficient across 17 readers for Cancer, Normal, Prior_Vis, Prior_Invis, and all cases were
0.17 (CI=0.03-0.31), 0.26 (CI=0.09-0.43), 0.30 (CI=0.12-0.49), 0.35 (CI=0.21-0.49), and 0.35 (CI=0.25-0.44),
respectively. The order of median AUCs in Cancer/Normal and Prior_Vis/Normal classifications from the highest to the
lowest was GistAVE>GistUnilateral>GistBilateral. All differences except the difference for GistAVE and GistUnilateral in
Prior_Vis/Normal classification were significant. In Prior_Invis/Normal classification, the order was
GistAVE>GistBilateral>GistUnilateral. None of the differences in the AUC values for Prior_Invis/Normal classification were
significant. On average, the AUCs of Cancer/Normal, Prior_Vis/Normal, Prior_Invis/Normal classifications based on
GistUnilateral respectively dropped by 8%±6%, 10%±8%, and 1%±8% in the bilateral experiment while these AUCs
increased by 5%±3%, 2%±4%, and 4%±6% after averaging two signals. On average, the AUCs of Cancer/Normal,
Prior_Vis/Normal, Prior_Invis/Normal classifications based on GistAVE were 82%±4%, 74%±3%, and 67%±5%.
Conclusions There is weak association between the gist signal from unilateral and bilateral mammograms. The signal was stronger in
the unilateral experiment. When two signals were averaged, the AUCs increased. The improvement could be as a result
of cancelling out random noise by averaging two values. Further investigation of intra-reader variability and exploring
the AUC when unilateral gist responses of a reader were averaged in multiple experiments is required.
-
Characterizing Image Features That Allow for Rapid Breast Cancer Detection Even Before Appearance of Visibly Actionable Lesions
Karla K. Evans1, & Jeremy M. Wolfe23
1Department of Psychology, University of York 2Department of Surgery, Brigham & Women's Hospital
3Department of Ophthalmology, Harvard Medical School
Rational Expert radiologists can detect a “global gist signal” in mammograms allowing them to distinguish normal from abnormal cases at above chance levels even in mammograms acquired before the development of visible, actionable lesions (“priors”). In previous studies with filtered images, we found that the gist signal was strong in the high spatial frequencies, not in the low frequencies. In the present study, we seek to more precisely isolate the spatial frequency information that radiologists are using to make successful gist decisions.
Methods Radiologists were presented with 120 bilateral mammograms. Half were completely normal and remained normal at least for four subsequent years. The other half were abnormal. These were subdivided equally into three different types of mammograms: subtle cancers, obvious cancers and mammograms acquired 3 years prior to the mammograms that showed visibly actionable cancer. Radiologists were asked to rate the abnormality of the images on a 0-100 scale after exposure of 500 msec. We collected ratings on this set from 21 radiologist at different experience levels. They viewed the full set in three different conditions across 3 blocks. The different conditions were; original mammograms without manipulation, mammograms maintaining spatial frequencies above 0.5 cycle per visual angle degree (cpd) and lastly mammograms maintaining spatial frequencies only above 1 cpd. The order of the blocks was counterbalanced across participants.
Results Using the normal cases for the estimate of false positives in all cases, we can calculate d’ for each of the three type of abnormal case and for each filter condition. The results are shown in the table for all 21 observers and, in parentheses, for the 16 observers who read more than 1000 cases per year:
Original Freq >0.5 cpd Freq > 1.0 cpd Subtle .79 (.94) .61(.84) .32 (.29) Obvious .85 (1.12) .88 (1.10) .62 (.62) Priors .05 (.18) .67 (.88) .05 (.04)
The most interesting finding is that performance improves for Priors when frequencies below 0.5 cpd are filtered out (F(1,15)=23.01, p
-
The “Gist” in Prostate Volumetric
Imaging
Melissa Treviño, Ph.D.1 and Todd S Horowitz, Ph.D1
Marcin Czarniecki, M.D.2
Ismail B Turkbey, M.D.3 and Peter L Choyke, M. D.3
1Basic Biobehavioral and Psychological Science Branch, National Cancer Institute
2Medstar Georgetown University Hospital 3Molecular Imaging Branch, National Cancer Institute
Rationale
Numerous cognitive psychology studies have demonstrated that we can determine the global
context of complex real-world scenes (“scene gist”) in a brief glimpse. Similarly, radiologists
can identify the “gist” of a radiograph (i.e., abnormal vs. normal) better than chance in breast,
lung, and prostate images presented for half a second. However, this rapid perceptual gist
processing has only been demonstrated in static two-dimensional images. Standard practice in
radiology is moving to three-dimensional (3D) “volumetric” modalities. In volumetric imaging,
such as multiparametric MRI (mpMRI), used in prostate screening, a single case consists of a
series of image slices through the body that are assembled into a virtual stack. Radiologists can
acquire a 3D representation of organ structures by scrolling through stacks. Can radiologists
extract perceptual gist from this more complex imaging modality?
Methods
We tested 14 radiologists with prostate mpMRI experience on 56 cases, each comprising a stack
of 26 T2-weighted prostate mpMRI slices. Lesions (Gleason scores 6-9) were present in 50% of
cases. In practice, lesions are more prevalent and easier to detect in the peripheral zone (PZ) of
the prostate than the transition zone (TZ). For lesion present trials, we used a PZ:TZ ratio of 5:2.
A trial consisted of a single movie of the stack. After each case, participants localized the
cancerous lesion on a prostate sector map, then indicated whether a cancerous lesion was
presented, and gave a confidence rating. Presentation duration was varied between groups.
Radiologists were divided into three groups who viewed cases presented at either 48 ms/slice
(20.8 Hz, n = 5), 96 ms/slice (10.4 Hz, n = 5), or 144 ms/slice (6.9 Hz, n = 4).
Results
Radiologists could detect lesions in both zones above chance, with PZ producing higher d’
scores (d’ [95% CI]: PZ = 0.73 [0.46 – 1.00]; TZ = .50 [0.08 – 0.91]). Detection performance did
not vary significantly with slice duration F(2,11) = 0.74, p = 0.50 (d’ [95% CI]: 48 ms = 0.64 [-
0.22 – 1.50]; 96 ms = 0.78 [0.31 – 1.24]; 144 ms = 0.38 [0.06 – 0.71]) Localization accuracy
-
(chance ~= 0.08) was 0.40, 0.47, and 0.48, respectively. While the interaction between slice
durations and zone did not reach significance, F(2,11) = 3.53, p =.07, detection peaked at 48 ms
for PZ and 96 ms for TZ (see Figure 1).
Conclusions
Our data indicate that radiologists do develop gist perception for 3D modalities. As expected,
detecting peripheral lesions was easier than transition lesions. Surprisingly, slower presentation
rates did not improve performance. There may be an optimal framerate for processing 3D
anatomical information, depending on anatomical site and/or lesion conspicuity, but further
research is needed.
Figure 1. d’ for lesions located in the PZ & TZ as a function of slice duration. Error bars display
95% confidence intervals.
-1
-0.5
0
0.5
1
1.5
2
48 96 144
d'
PZ & TZ d'
PZ TZ
-
Perceptual gist in multiparametric imaging
Todd S Horowitz, Ph.D1
and Melissa Treviño, Ph.D.1
Marcin Czarniecki, M.D.2
Ismail B Turkbey, M.D.3
and Peter L Choyke, M. D.3
1Basic Biobehavioral and Psychological Science Branch, National Cancer Institute
2Medstar Georgetown University Hospital
3Molecular Imaging Branch, National Cancer Institute
Rationale
Humans can extract the “gist” of a visual scene in a fraction of a second, categorizing it as, say, indoor
or outdoor, open or closed. This information can facilitate recognition of objects and guide future eye
movements. Recent studies have demonstrated an analogous ability for radiologists to classify briefly
presented images (e.g., mammograms) as “normal” or “abnormal”. Previously, we have extended that
finding to prostate multiparametric magnetic resonance imaging (mpMRI). MpMRI combines
anatomical information from T2
-weighted (T2
W) sequences, and functional sequences such as
conventional diffusion-weighted imaging (DWI) and the apparent diffusion coefficient (ADC).
Standard workstation formats present these imaging modalities side-by-side. Our goal was to study the
nature of mpMRI gist in different modalities. Which modality generates the strongest gist? Are
anatomical or functional sequences more useful? Do these modalities provide independent gist
information? Furthermore, we tested the hypothesis that experts performed better because they were
more likely to fixate lesions during the brief exposure.
Methods
Experiment 1:Three groups of five radiologists with prostate mpMRI experience were shown 100
images from a single modality (T2
W, DWI, or ADC). The same cases were used across groups. Lesions
(Gleason scores 6-9) were present in 50% of the images. Images were taken from the base, mid, or
apex regions of the prostate. Stimuli were presented for 500 ms, followed by a prostate sector map.
Participants first localized the lesion on the sector map (whether or not they saw a lesion), indicated
-
whether or not a lesion was present, then provided a confidence rating. In Experiment 2, seven novice
observers with no radiological training and two radiologists with prostate mpMRI experience
performed the same task on a set of 100 T2
W images while a Tobii eye tracker recorded their eye
movements.
Results
Experiment 1: All three groups detected lesions better than chance [d' mean(sd): T2
W 0.83(0.51); DWI
0.80 (0.29); ADC 1.16(0.31)]. Partial correlations between modalities, holding Gleason score
constant, were moderate and significant: [r: T2
W x DWI .23; T2
W x ADC .35; DWI x ADC .37].
Experiment 2: One novice was excluded due to poor-quality eye tracking data. Radiologists again
demonstrated above chance lesion detection [d’ mean (sd) 1.30 (0.47)], while novices did not [d’ mean
(sd) 0.01 (0.40)]. As expected, radiologists were more likely to fixate the lesion on target- (i.e., lesion-)
present trials [radiologists: 60%; novices: 26%]. However, this advantage was not driving their superior
lesion detection. For both groups, lesion detection was no more likely when fixating the lesion than
when failing to fixate (chi-sq, p: radiologists 0.92, .33; novices 0.06, .80).
Conclusions
These results indicate that the ADC modality generates the strongest gist signal, but both anatomical
and functional sequences can contribute to mpMRI gist. The moderate correlation across modalities
suggests redundancy. Importantly, performance was unaffected by whether or not observers fixated the
lesion. Radiologists make more informed eye movements even in brief glimpses, but eye movements
do not drive gist perception. Future studies should explore how gist perception drives eye movements
across imaging modalities.
-
The Sum or Parts: Exploring Radiologist
Reliance on Peripheral Vision and
Holistic Processing through Gaze-
Contingent Viewing
Grace L. Nicora B.A.¹, Victoria Wilson¹, Dustin Stokes PhD², Jeanine Stefanucci PhD¹, &
Trafton Drew PhD¹
Department of Psychology¹ and Department of Philosophy² at the University of Utah
Rationale
The holistic processing theory of expertise posits that experts can quickly take in global
information from their stimuli and then use that information to guide their search for a target.
This perceptual processing advantage is thought to underlie the superior performance associated
with expertise. In radiology, this theory posits that when presented with a chest x-ray, expert
radiologists process the Gestalt (or whole) of the image within as little as 100 milliseconds. This
initial Gestalt impression is thought to quickly guide attention to regions that deviate from
normal, thereby enabling fast and accurate performance. Importantly, the holistic processing
theory predicts that experts rely on their peripheral vision to quickly extract the Gestalt
impression of a case, and that this ability does not generalize to tasks outside of their expertise.
Methods
We tested this theory with radiologists through the use of a gaze-contingent viewing (GCV)
window. A GCV window allows us to restrict the amount of peripheral information available to
the viewer. In this design, radiologists were only able to see a circular region (5° of visual angle)
where they were actively fixating. All other parts of the image were occluded. In order to see
more of the image, radiologists had to move their eyes to reveal different parts of the image.
Radiologists searched two types of images: one was a chest x-ray and the other a non-
radiographic image. To test the reliance on peripheral vision, radiologists were exposed to a
normal viewing condition and the GCV condition. Holistic processing theory predicts these
experts will be impaired in the presence of the GCV window to a greater extent for the chest x-
ray image compared to the control image.
Results
As expected, GCV led to increased viewing time in both tasks. Critically, this cost was larger for
the radiology task than the control task. We also observed an interaction in saccadic amplitude.
-
GCV led to shorter saccades or both conditions, but the effect was much larger when viewing
chest radiographs.
Conclusions
Our results support the holistic processing theory of expertise in radiology. As predicted by the
holistic processing theory, experts were more impaired in their domain of exper