the power of collective intelligence in a signal detection
TRANSCRIPT
For Peer Review Only
1
The Power of Collective Intelligence in a Signal Detection Task
Mary L. Cummings* Humans and Autonomy Laboratory
Duke University Durham, NC, USA
Paul W. Quimby Massachusetts Institute of Technology
[email protected] Cambridge, MA, USA
* Corresponding Author
Page 1 of 21
Mary L. Cummings, PhD: Professor Cummings received her Ph.D. in systems engineering from the University of Virginia in 2004. She is currently a Professor in the Duke University Department of Mechanical Engineering and Materials Science, the Duke Institute of Brain Sciences, and the Duke Electrical and Computer Engineering Department. She is the director of the Humans and Autonomy Laboratory and Duke Robotics.
Paul W. Quimby received his Master’s of Engineering in Electrical Engineering and Computer Science from the Massachusetts Institute of Technology in 2013. His current research interests include remotely piloted and autonomous aircraft, autonomous system management, and industrial human factors.
URL: http://mc.manuscriptcentral.com/ttie
Theoretical Issues in Ergonomics Science
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review Only
2
Abstract
A novel spatial display for a classic signal detection problem was tested in an
experiment with thirty naïve volunteers and two experts, measuring the signal
detection properties of a spatially-organized display compared with temporally-
organized display. The collective performance of committees was also measured.
The spatial display was quantitatively and qualitatively superior, resulting in a
4.6% increase in correctness of classifications and a 30.5% decrease in miss
percentages. Qualitatively, 83% of participants preferred the new display. The
collective receiver operating characteristic (ROC) performance of the top half of
best performing naive participants was superior to that of most individuals and all
experts. However, in an analysis that examined the value added of both the
improved spatial display and the collective intelligence approach, the results
demonstrated that the bulk of performance gains can be attributed to the improved
design, as compared to the value of collective action, by roughly 2:1.
Relevance to human factors/ergonomics theory: This study compares a traditional temporal
display and a spatial display for a human attempting classification in a classic signal detection
setting. Collective intelligence improved overall performance, but following established best
practices for decision support displays provided bigger performance gains.
Keywords: signal detection, display design, collective intelligence, ground penetrating radar,
receiver operating characteristic
Page 2 of 21
URL: http://mc.manuscriptcentral.com/ttie
Theoretical Issues in Ergonomics Science
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review Only
3
Introduction With advances in radar technology, signal processing, and artificial intelligence,
human operators are increasingly employed to determine the presence or absence of
various signals, including baggage and personnel screening and optical inspection such as
systems that scan underneath cars for explosive devices. Because of the uncertainty in
such systems and the lack of clear signal thresholds, the detection of objects with
automated technologies has not advanced to the point where human judgment is no longer
needed. Therefore investigating the ability of human operators to distinguish objects from
noise is important to the advancement of object detection systems. In addition, modeling
human behavior in this domain provides insights into how automation could assist a
human, and vice-versa. Moreover, understanding the currently achievable levels of
performance and the relative merits of both human and automation elements enables
engineers to construct the most functional systems possible.
The best possible signal detection system in these environments not only effectively
differentiates signals from noise but also provides operators with efficient and reliable
means of finding objects of interest. One key aspect of such a system is the human
interface design, i.e., how well the system assists an operator in object detection.
Typically such systems are designed so that a single operator evaluates a solution, e.g., a
single baggage screener looks at a screen to determine whether an illegal object is in an
image. However, recent research suggests one form of “collective intelligence”, known as
the wisdom of crowds, can be superior to an individual’s decision when pooling
independent answers to a question from multiple people yields a superior solution than
that proposed by a few people (Surowiecki 2004). Unfortunately, there is little to no
research that examines the value added by taking such a collective intelligence approach
to a real world problem, which could have significant personnel and scheduling
implications as compared to better designing displays for individuals.
To this end, in order to investigate both the role of improved decision support and the
potential added value of collective intelligence in an image detection task, a test bed was
used that leveraged an experimental vehicle-mounted ground penetrating radar (GPR)
Page 3 of 21
URL: http://mc.manuscriptcentral.com/ttie
Theoretical Issues in Ergonomics Science
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review Only
4
simulation, developed by MIT Lincoln Laboratory (Stanley 2016). This system consisted
of an electronically controlled, but mechanically actuated stock vehicle, GPR array,
computer system, and the display for the operator (described in the next section). The
vehicle traveled a previously traversed route and continuously compared the incoming
GPR data to previously recorded GPR data. The goal of this system was to allow an
operator to find buried objects via a change detection paradigm. The chief question
operators must answer is whether they think the display depicts an object or a false alarm.
Such technologies are critical for saving lives in the detection of Improvised Explosive
Devices (IEDs), a current real world application of GPRs.
Improving the design
Figure 1 depicts the original GPR operator display. The radar data is captured and
analyzed by various algorithms using the location of each sensor at the time the data is
recorded. In the original interface, this sensor data is displayed to the user via colored
rectangles, organized by time (on the horizontal axis) and radar elements (on the vertical
axis, Figure 1). A fixed number of sensor readings is displayed at all times (Figure 1(iii)),
with older values being removed from the left edge to make room for new readings added
to the right edge (Figure 1(i)), simulating the forward motion of the vehicle.
Figure 1: Original GPR display
(iv) Fixed number of time steps
(iii) Fixed number of radar elements
(ii) Region of no data (i) Most recent data appears
Page 4 of 21
URL: http://mc.manuscriptcentral.com/ttie
Theoretical Issues in Ergonomics Science
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review Only
5
Since the GPR data is captured on a polling timer, this means that the graphic
representations of objects are distorted on the display depending on how quickly the
sensor is traversing the ground. Objects are also distorted by the turning of the vehicle,
which decreases the temporal spacing between sensor readings on the inside radius of the
turn compared to the outside of the turn. No measurements are taken if the forward
movement is very small, if the vehicle is stationary, or if the vehicle is moving backward.
Operators of the actual original GPR radar found it difficult to interpret, particularly
with the distortion and the representation of the front of the vehicle on the right side of
the display, which was constantly updating. In order to address the design shortcomings
of the original display, we proposed a redesign of the user interface (Figure 2).
The old interface (Figure 1) displays the newest data on the right edge of the display
(Figure 1(i)). This is called a ‘‘track-right’’ display and presents the viewer with the
experience of traveling to the right. This presentation of the data is at odds with the fact
that the vehicle is moving forward, corresponding to the top edge of the display. In
(i) Direction offorwardmotion
(vi) Radar array outline
(iii) Detection alert
(iv) Previous detection alert
(vii) Tire outline
Figure 2: Proposed redesigned GPR display
(v) Reference traversal
(ii) No data
(viii) 1m grid
Page 5 of 21
URL: http://mc.manuscriptcentral.com/ttie
Theoretical Issues in Ergonomics Science
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review Only
6
keeping with the principle of the moving part (Wickens and Hollands 2000, Roscoe 1968),
the new interface uses a ‘‘track-up’’ display with the newest data at the top (Figure 2(i)).
This approach allows new data to populate along the top edge of the display, indicative of
vehicle forward motion, rather than populating along the right edge of the image. To further
improve situation awareness, the physical location of the radar array on the front of the
vehicles was added (Figure 2(vi)).
In order to further improve the ability of operators to understand where a signal is
located geographically from one of the sensors, the new display renders the data as points
using a nearest-neighbor interpolation to avoid removing key information. This should
help improve the operators’ ability to perceive where the signal is located geographically,
thus improving their perception-based situation awareness (Endsley, 2000). Areas of no
data only show a gray background (Figure 2(ii)), as opposed to the blue background of
Figure 1(ii)).
Detection indicators were added to the new display to help the operator identify what
data to examine, in order to reduce their working memory requirements (Nielsen 1993)
(Figure 2(iii)). Large triangles indicate current detection events (Figure 2(iii)), while small
triangles indicate previous detection events (Figure 2(iv)). These indicators help operators
maintain a temporal understanding of what has previously been analyzed and what is new.
Other improvements that were made to reduce the time needed to build situation
awareness and an effective mental model (Staggers and Norcio 1993) focused on making
data relationships visible including adding a black line on each side of the path of the
vehicle’s sensors to facilitate understanding the path of the current versus reference
traversal (Figure 2(v)). In addition, the radar array and the two front tires (Figure 2 (vi &
vii, respectively)) were made visible and a faint grid overlay was added to provide a scale
(Figure 2(viii)).
The design was revamped to reduce the cognitive complexity of the display (Xing
2006). The original colormap for GPR was a full rainbow, chosen because it was thought to
have superior contrast to standard black and white colormaps and is standard in GPR
systems (Hunt, Massie, and Cull 2000). The new interface in Figure 2 uses a simpler
colormap, based on three colors, a blue, a light gray, and a red. The blue is the lowest radar
Page 6 of 21
URL: http://mc.manuscriptcentral.com/ttie
Theoretical Issues in Ergonomics Science
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review Only
7
return value shown and the red is the highest radar return value shown. A dark gray is used
to signify a lack of data (Figure 2(ii)). Not only does this display reduce the cognitive
complexity associated with color interpretations, but it also fulfilled the requirement of
avoiding combinations of colors confused by individuals with deuteranopia and tritanopia
color blindness (Harrower and Brewer 2003).
The new design of the GPR display was based on following established human-
centered design principles such as the principle of the moving part, reduced cognitive
complexity, design for recognition instead of recall, and making critical data relationships
visible. Testing should reveal whether the design was improved, as well as how these
design changes affected performance, both individually and collectively.
Method
Participants
Thirty-four participants from a large technology research and development
company were recruited over email and through word of mouth. Three participants were
considered experts due to prior affiliation with the project in that they were key members
in the design of the system as well as the operator display. One non-expert participant and
one expert were disqualified due to colorblindness, yielding a pool of thirty experimental
participants and two experts. Out of thirty participants, twenty-one were male and nine
female. Participants ranged in age from twenty-two to seventy-two (mean = 41 years, SD
(standard deviation) = 13 years). All participants had self-reported corrected vision 20/25
or better.
Simulation
While the fielded GPR system used a commercial flat screen display mounted on the
dashboard with resistive touchscreen capabilities, for convenience and cost savings, this
experiment was conducted using an Apple iPad® laying flat on a tabletop. The iPad had
the same pixel density (52 pixels/cm) as the fielded display, with similar resistive
touchscreen capabilities. The results from this study are not expected to be directly
transferable to the fielded system, but rather are meant to examine relative differences
Page 7 of 21
URL: http://mc.manuscriptcentral.com/ttie
Theoretical Issues in Ergonomics Science
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review Only
8
given both this specific display improvements as well as the use of a collective approach.
Experiment Design
This study was a within-subjects, repeated measures design. Forty-eight images
were noise and 54 contained a signal. The order of the scenarios was counterbalanced and
randomized for every participant and between the two interfaces. Scenario order
(First/Second) was counterbalanced and randomized across participants and between
interfaces (Old/New). Participants completed 102 scenarios on one interface, then 102
scenarios on the other.
Procedure
Each participant was briefed on the interface, the symbology and shown ten scenarios
with the correct answers. Then they practiced on their own, with these same ten scenarios
presented in a different order. For each scenario presented, participants could press one of
two buttons labeled “Object” or “False Alarm”. After submitting each answer in the
training sequence, participants were shown the correct answer. Figure 3 shows example
representations of Noise and Objects as indicated on the Old vs. New displays. Figure 3b
illustrates the selection button.
At the end of the ten training scenarios, a summary screen displayed the results and the
experimenter talked through the incorrect answers with the participant and answered any
questions. If the participant incorrectly answered three or more scenarios, a remedial
training protocol was given which included an additional six scenarios. Three participants
required the remedial training protocol for one of the two interfaces, and one participant
required the remedial training protocol for both interfaces. Participants were informed that
their performance would only be assessed by the correctness of their answers, and not by
the time they took to complete the exercise. Participants were also informed of a $100 gift
certificate for the best performance.
Page 8 of 21
URL: http://mc.manuscriptcentral.com/ttie
Theoretical Issues in Ergonomics Science
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review Only
9
After training the test scenarios began. Unlike the training session, participants
were not allowed to change their answer nor did they receive feedback on their
performance. They were also not shown the correct answers for each scenario. After each
answer, a slider was displayed and the participant was asked to rate his or her confidence
Object present on old interface (left) and on new interface (right)
Object not present (noise) on old interface (left) and on new interface (right), with selection button presented for each image
Figure 3: Example scenarios on old and new interfaces
Object False Alarm or
Page 9 of 21
URL: http://mc.manuscriptcentral.com/ttie
Theoretical Issues in Ergonomics Science
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review Only
10
in a given answer on a seven-point scale ranging from not at all confident to very
confident. Every ten scenarios, a slide was shown that suggested the participant could take
a break if desired.
Results Given the within-subjects design of this experiment, paired t-tests were used for
data that met underlying normality assumptions. The Wilcoxon signed-rank test was used
to evaluate paired samples and the Mann Whitney U test was used to evaluate
independent samples when nonparametric measures were required, alpha = .05.
Participants correctly categorized more GPR scenarios using the new interface,
t(29) = 2.66, p = .013. The new interface had a slightly higher average of percentage
correct as compare to the old (Mold = 66.8% SDold = 4.4%, Mnew = 69.9% SDnew = 4.8%),
which is a 4.6% improvement (Figure 4a.)
Participants had significantly fewer misses on the new interface, z = 3.77, p < .001 (Mold
= 19.0% SDold = 6.0%, Mnew = 13.2% SDnew = 6.1%) (Figure 4b). A miss occurs when
there is an object present, but the participant elects to select the “False Alarm” button.
This was a 30.5% improvement over the old interface.
Figure 4: (a) Correct (left) and (b) miss (right) ratios using the old and new interfaces with 95% confidence intervals
Page 10 of 21
URL: http://mc.manuscriptcentral.com/ttie
Theoretical Issues in Ergonomics Science
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review Only
11
Insufficient evidence was found to suggest participants were more confident in
their answers on either interface, t(29) = 1.92, p = .065), although the trend was in
favor of the new interface (Mold = 4.62 SDold = .64, Mnew = 4.71 SDnew = .70).
However, this metric does not capture the confidence of participants as a
function of their performance. To this end, a metric was calculated to assess the
appropriateness of participants’ confidence in their answers. This metric was
calculated as follows:
Confidence correctness = Σxici / Σci
xi ≡ 1 if the participant answered scenario i correctly, or -1 otherwise
ci ≡ the participant’s confidence [1 − 7] in their answer for scenario
The confidence-correctness score calculates higher scores for correct scenarios. This
straightforward computation weights correct answers by the confidence of participants and
is needed since previous research indicates that subjective confidence ratings often do not
align with task difficulty (Nickerson and McGoldrick 1963). Participants’ confidence-
correctness scores statistically favored the new interface, t(29) = 3.75, p = .001 (Mold =
.38 SDold = .09, Mnew = .47 SDnew = .11).
A notable demographic result is that women performed better than men in
many respects, but only on the new interface. For the improved interface, women
answered more scenarios correctly, t(29) = 3.20, p = .003 (Mmale = 68.1% SDmale = 4.4%,
Mfemale = 73.6% SDfemale = 3.7%), missed fewer objects, Mann Whitney U = 44.5, p =
.022 (Mdmale = 17.8%, Mdfemale = 8.3%), had a higher confidence correctness score,
t(29) = 2.40, p = .023 (Mmale = .44 SDmale = .10, Mfemale = .53 SDfemale = .08), and had a
higher average confidence, t(29)=2.65, p = .013 (Mmale = 4.56 SDmale = .66, Mfemale =
5.13 SDfemale = .64). These results run contrary to other findings that report women
tend to undervalue their own performance in similar situations (Ehrlinger and Dunning
2003).
In terms of subjective preference, participants overwhelmingly preferred the new
interface. When asked to assess their preference on a five-value Likert scale between the
interfaces, 83% preferred or strongly preferred the new interface, while 17% expressed
no preference or preferred the old interface.
Page 11 of 21
URL: http://mc.manuscriptcentral.com/ttie
Theoretical Issues in Ergonomics Science
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review Only
12
The collective intelligence comparison
While improving the design of a decision support display through well-
established human-centered techniques was expected to produce improved
performance results for individuals, what is less well known is how such an
improvement could improve collective decision making. Moreover, while there has
been some research suggesting collective decision making can be superior to
individual responses in such image anomaly search detection tasks, there has been
very little research looking at the relative value added comparing decision support
improvements versus the use of collective intelligence.
For example, a study examining the relative performance between individuals and
groups in a skin cancer detection task has shown value in collective analysis (Kurvers
et al. 2015), as did another study looking at the power of group decision making in
mammography analysis (Wolf et al. 2015). While interesting and potentially useful in
these specific domains, there was no assessment of how a possibly poorly-designed
interface could contribute to subpar individual performance. Thus this study compares
both the improvements with individual and group decision making over the old
interface in order to examine the relative value added of decision support design
changes versus leveraging collective intelligence.
To this end, committee Receiver Operating Characteristic (ROC) curves were
obtained of various “committees” of people who effectively “voted” as to the presence
or absence of a target. These votes were recorded from the experimental results
described in the previous section and were combined to see how the performances of
various groups compared to individual performances. For ranking purposes, performers
were ordered by the total area under the convex hull of their ROC curve to determine their
membership in committees that ranged in size from one (the best performer) to a
committee of 30 (representing all participants.) Intermediate committee sizes included 3,
5, and 15 people.
Given that errors vary with the square root of the number of samples, we elected to
use the largest committees possible to determine the value added of collective intelligence
versus improved display design. This resulted in choosing two committees of the 15 best
performers and the worst 15. Then the aggregated best and worst performances of these
Page 12 of 21
URL: http://mc.manuscriptcentral.com/ttie
Theoretical Issues in Ergonomics Science
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review Only
13
participants were compared to individual performances to understand the range of likely
outcomes. Further details about additional formulations of various committees and their
performances, particularly in comparing the best and worst performers, can be found
elsewhere (Quimby 2013).
The ROC curves were computed from each participant’s True Positive Rate (TPR)
and False Positive Rate (FPR) for both interfaces (Figure 5). To determine the top and
bottom committee membership, performers were ordered by the total area under the
convex hull of their ROC curve, which is a technique that creates iso-performance
lines for classifiers, i.e., the participants (Provost and Fawcett 2001). The 30
participant ROC curves are represented by the faint blue and red lines in Figure 5, with
the two committee performances for the old and new interfaces in the bold and dashed
lines.
As shown in Figure 5, the interface mattered, with the new interface consistently
outperforming the old interface when TPR values were greater than 60%. As expected,
in all cases the better committees using both interfaces outperformed the worse
committees. In general for signal detection tasks, the goal is to have high TPRs but low
FPRs, making the upper left region the most desirable.
While Figure 5 demonstrates that the committee formed from the best people
performed better using the new interface, it does not answer how the committee
performed when compared to the best performer, which is shown in Figure 6. The
Figure 5: ROC curves for best and worst committees of 15 with old and new interfaces
Page 13 of 21
URL: http://mc.manuscriptcentral.com/ttie
Theoretical Issues in Ergonomics Science
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review Only
14
performance of the 15 person best committee (labeled in Figure 6 as 15-Best
(Old/New)) appears to be better or equivalent to the best performer for the new
interface (labeled in Figure 6 as 1-Best (Old/New)), while the best performer under the
old interface was better than the committee between 55-80% TPR.
Given the nature of ROC curves, to be able to more effectively compare the
relative performance impacts of both the interface design as well as the value of
collective intelligence, an anchor metric must be selected. As discussed previously,
ideally the ROC curves would be in the upper left corner, and indeed, for very mature
tests such as a peptide test to diagnose heart failure (Florkowski 2008), it is routine to
have both high TPRs as well as low FPRs. However, such curves are not common for
ground radar systems and there is a distinct tradeoff that must be made between TPRs
and FPRs.
For GPR systems, the cost of a false positive is not necessarily high for a single
incident, in that if a suspected detection is false, this just means that time and resources
are lost in the investigation and perhaps detonation of a falsely suspected device like
an IED. However, while one single false positive is not necessarily as problematic as it
might be in an unneeded medical procedure that could harm a patient, repeated false
positives in a GPR setting could be extremely dangerous. Investigation of false
positive incidents adds significantly more time to missions, and in the case of IED
detection, ultimately exposes soldiers to potentially harmful enemy fire. Ultimately,
these soldiers could lose trust in such a system so while the acceptance of some false
Figure 6: Old (left) and new (right) interface ROCs with best performer comparison
Page 14 of 21
URL: http://mc.manuscriptcentral.com/ttie
Theoretical Issues in Ergonomics Science
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review Only
15
positives is permissible, they should be minimized (Cazares 2017).
In order to compare both the relative impact of the display redesign and the use
of collective intelligence, Table 1 summarizes the false positive rates (FPR) for various
individuals and groups for the experiment detailed previously for two given true
positive rate (TPR) thresholds of 80% and 90%. These thresholds are typical for GPR
systems (Chambers et al. 2013). The “Interface FPR Difference” column of Table 1
displays the percent decrease in false positive rates for users of the new interface as
compared to the old interface. Similarly, the row “Best 15 FPR Difference Over
Individuals” in Table 1 displays the percent decrease in false positive rates for the
committee of best 15 performers compared to the best and worst performers, as well as
consensus of all 30 individuals. It should be noted that the assignments of Best and
Worse Individuals means that they were ranked by the total areas under the convex
hulls of their ROC curves for overall best performance, and so may not necessarily
have the highest performance at any discrete point. Similarly, the Group Consensus
false positive rates represent the majority performance for 30 people combined.
Table 1: False positive rates for individuals and groups for 80% and 90% true positive rates for both
interfaces
Old 80% TPR
New 80% TPR
Old 90% TPR
New 90% TPR
Interface FPR Difference
80% 90%
Best Individual 32% 25% 66% 35% 22% 47%
Best 15 Committee 35% 24% 51% 32% 31% 37%
Worst Individual 78% 63% 89% 82% 19% 8%
Worst 15 Committee 48% 33% 71% 40% 31% 44%
Group Consensus 38% 25% 60% 37% 34% 38%
Best 15 FPR
Difference Over
Individuals
Best -9% 4% 23% 9%
Group 8% 4% 15% 14%
Worst 55% 62% 43% 61%
Table 1 illustrates that across both individuals and groups of decision makers, at an
80% TPR, the new interface provides a 19-34% reduction in false positives, and an 8% -
47% reduction at the 90% TPR. Not surprisingly, the new interface provided the smallest
improvement for the worst individual at both TPRs, but the biggest FPR reduction (47%)
Page 15 of 21
URL: http://mc.manuscriptcentral.com/ttie
Theoretical Issues in Ergonomics Science
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review Only
16
occurred for the best individual using the new interface for the 90% TPR. When
considering how much the new interface improved group consensus performance, the new
interface reduced FPRs by 34% at 80% TPR and 38% at 90% TPR.
The remaining question is how much value was added by using the collective
intelligence approach over various decision makers? In Table 1, using the committee of
the top 15 participants as the basis of comparison, the biggest improvement in the FPRs
for both TPRs is for the worst performer ranging from 33%-63%, depending on the
interface. However, in comparison to the magnitude of reduction for the improved display
design, the improvements made by adding a collective intelligence component were
modest when compared to the best decision makers or even the group as a whole, ranging
from 4% to 23%. Indeed, in one case the collective best 15 group performed worse than
the best performer, increasing the FPR by 9% using the old interface.
Expert users
In addition to the “best” performers for both the old and new interfaces, we also
were fortunate to have access to two experts that were key personnel in the design of the
GPR system (beyond the thirty participants in the experiment, and they played no role in
the design and conduct of this research study). These experts were heavily involved in
the design of the GPR system. Figure 7 shows the ROC performance of these experts in
comparison to the committee and the best performers. These experts performed worse
Figure 7: Selected ROC curves for experts and 15 and 30 person committees
Page 16 of 21
URL: http://mc.manuscriptcentral.com/ttie
Theoretical Issues in Ergonomics Science
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review Only
17
than the 15-Best committee for both interfaces, in addition to performing worse than the
aggregation of all 30 participants (depicted in Figure 7 as the 30-Committee). Generally
both experts were on par, or slightly worse than the 15-Worst person committee.
Individually, the experts were ranked 5 and 20 out of 32 on the old interface (when
the thirty participants were combined with the 2 experts) in terms of correct detections,
and they were ranked 10 and 11 out of 32 on the new interface. This indicates that the
experts were potentially better than many participants, but still outperformed by many
others who had substantially less experience with the system. This finding suggests two
important points. First, experts who design a system cannot be relied upon to provide
performance benchmarking. In this experiment, people with one hour of experience were
able to significantly outperform experts with hundreds of hours of experience with the
system. Arguably their expertise may have instilled bias in their decision making, but one
question this result raises is that for such anomaly search tasks, is there a point at which
experience becomes detrimental to performance?
In addition, while the use of collective intelligence to actively process images may
not be feasible for many agencies in terms of scheduling and personnel requirements (and
Table 1 suggests improved display design carries more weight), this research suggests
that the use collective evaluation approaches may be useful in the design stages to
illustrate deficiencies and also help stakeholders like designers better understand the
limitations of their own designs.
Conclusions
Human operators are increasingly employed to determine the presence or absence of
various signals in image search tasks like those in medicine and security. As previously
discussed, recent research has suggested that collective intelligence, i.e., using groups to
vote on the presence of a signal, can improve overall outcomes. However, how such
collective intelligence performance improvements compare to those gained by ensuring
decision support displays embody user-centered design principles has yet to be investigated
in the open literature.
To this end, a current Ground Penetrating Radar (GPR) display organized temporally
Page 17 of 21
URL: http://mc.manuscriptcentral.com/ttie
Theoretical Issues in Ergonomics Science
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review Only
18
was improved through widely accepted best design practices including the principle of the
moving part and making critical data relationships visible. Using traditional within-subjects
t-tests, the new spatially organized display was superior in terms of overall correct target
identification and a 30.5% decrease in missed targets. Interestingly, women in this study
dominated in terms of performance, which could be an area for further study. Subjectively,
participants preferred the new interface by an overwhelming majority.
To further complement this analysis to determine the impact of adding a collective
intelligence component to the study, receiver operating characteristic (ROC) curves were
constructed with various groups of people include the best 15 and worst 15 performers.
Results demonstrated that the committee of the best 15 people generally performed better
on both interfaces than the worst 15 participants and the experts who created the system.
Thus these results are in line with other studies that show there is some advantage to the
use of collective intelligence for such image search tasks (Wolf et al. 2015, Kurvers et al.
2015). However, in an analysis that compared false positive rate changes for two different
true positive rates in the ROCs for both the old and new interfaces, as well as for groups vs.
individual decision makers, the FPR reduction occurs more for the improved design
intervention as compared to the value of collective action, by roughly 2:1. This was
especially true for top performers.
These results suggest that when trying to improve performance of humans in signal
detection tasks, the biggest gains could be made by ensuring decision support displays
embody basic design principles (Wickens and Hollands 2000). While the collective
intelligence approach did add value, such a decision support could be difficult to construct,
particularly for real time situations like those faced by some military personnel. In these
applications, it is then critical to make sure the displays are well designed. Moreover, the
results also showed that individual characteristics could potentially have a significant
impact, and that expert designers should not necessarily be relied upon to set performance
standards. While this initial analysis demonstrated many interesting results, further analysis is
warranted.
Acknowledgements
Page 18 of 21
URL: http://mc.manuscriptcentral.com/ttie
Theoretical Issues in Ergonomics Science
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review Only
19
This work is sponsored by the Department of the Army under Air Force Contract #FA8721-05-
C-0002, and in collaboration with MIT Lincoln Laboratory. We extend special thanks to the
reviewers who were very patient and thorough.
Key Points
• The spatial display was quantitatively and qualitatively superior to the temporal
display. The new display resulted in a 4.6% increase in correctness of the
participants’ classifications and a 30.5% decrease in miss percentage.
• Qualitatively, 83% of participants preferred the new display.
• Using a collective intelligence approach, a committee of the best 15 people
outperformed most individuals and all of the experts.
• When comparing the relative contribution of display design improvements to those of
the collective intelligence approach, the bulk of performance gains due to the improved
design outweigh the value of collective action by roughly 2:1
References
Cazares, Shelley M. 2017. "The Threat Detection System That Cried Wolf: Reconciling Developers with Operators." Defense Acquisition Research Journal 80:42-64.
Chambers, D. H., D. W. Paglieroni, J. E. Mast, and N.R. Bee. 2013. Real-Time Vehicle-Mounted Multistatic Ground Penetrating Radar Imaging System for Buried Object Detection. Livermore, CA: Lawrence Livermore National Security, LLC,.
Ehrlinger, J., and D. Dunning. 2003. "How chronic self-views influence (and potentially mislead) estimates of performance." Journal of Personality and Social Psychology 84 (1):5-17.
Florkowski, C.M. 2008. "Sensitivity, Specificity, Receiver-Operating Characteristic (ROC) Curves and Likelihood Ratios: Communicating the Performance of Diagnostic Tests." Clinical Biochemist Reviews 29 (1):83-87.
Harrower, M., and C. A. Brewer. 2003. "ColorBrewer.org: An Online Tool for Selecting Colour Schemes for Maps." The Cartographic Journal, 40 (1):27-27.
Hunt, L. D., D. Massie, and J. P. Cull. 2000. "Standard palettes for GPR data analysis." Eighth International Conference on Ground Penetrating Radar, Gold Coast, Australia.
Kurvers, RHJM., J. Krause, G. Argenziano, I. Zalaudek, and M. Wolf. 2015. "Detection Accuracy of Collective Intelligence Assessments for Skin Cancer Diagnosis." AMA Dermatology 151 (12):1346-1353.
Nickerson, Raymond S., and Charles C. McGoldrick. 1963. "Confidence, Correctness, and Difficulty with Non-Psychophysical Comparative Judgments." Perceptual and Motor Skills 17 (1):159-167.
Nielsen, J. 1993. Usability engineering. Cambridge, MA: Academic Press Provost, Foster, and Tom Fawcett. 2001. "Robust classification for imprecise environments." Machine
Learning Journal 42 (3):203–231. Quimby, P.W. 2013. "A Spatial Display for Ground-Penetrating Radar Change Detection." Master's,
Page 19 of 21
URL: http://mc.manuscriptcentral.com/ttie
Theoretical Issues in Ergonomics Science
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review Only
20
Aeronautics and Astronautics, Massachusetts Institute of Technology. Roscoe, S. N. 1968. "Airborne displays for flight and navigation." Human Factors 10:321-332. Staggers, N., and A.F. Norcio. 1993. "Mental models: concepts for human-computer interaction
research." International Journal of Man-Machine Studies 38:587-605. Stanley, Byron. 2016. "Localizing Ground Penetrating Radar." Tech Notes, June Surowiecki, James. 2004. The Wisdom of Crowds. New Work: Anchor Books. Wickens, Chris D., and Justin G. Hollands. 2000. Engineering Psychology and Human Performance. 3rd
ed. Upper Saddle River, N.J.: Prentice Hall. Wolf, M., J. Krause, P.A. Carney, A. Bogart, and R.H.J.M. Kurvers. 2015. "Collective Intelligence Meets
Medical Decision-Making: The Collective Outperforms the Best Radiologist." PLoS ONE 10 (8): e0134269.
Xing, J. 2006. Color and visual factors in ATC displays. Washington DC: FAA Civil Aerospace Medical Institute.
Page 20 of 21
URL: http://mc.manuscriptcentral.com/ttie
Theoretical Issues in Ergonomics Science
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review Only
21
Page 21 of 21
URL: http://mc.manuscriptcentral.com/ttie
Theoretical Issues in Ergonomics Science
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960