the power of collective intelligence in a signal detection

For Peer Review Only

1

The Power of Collective Intelligence in a Signal Detection Task

Mary L. Cummings* Humans and Autonomy Laboratory

Duke University Durham, NC, USA

[email protected]

Paul W. Quimby Massachusetts Institute of Technology

[email protected] Cambridge, MA, USA

* Corresponding Author

of 21

Mary L. Cummings, PhD: Professor Cummings received her Ph.D. in systems engineering from the University of Virginia in 2004. She is currently a Professor in the Duke University Department of Mechanical Engineering and Materials Science, the Duke Institute of Brain Sciences, and the Duke Electrical and Computer Engineering Department. She is the director of the Humans and Autonomy Laboratory and Duke Robotics.

Paul W. Quimby received his Master’s of Engineering in Electrical Engineering and Computer Science from the Massachusetts Institute of Technology in 2013. His current research interests include remotely piloted and autonomous aircraft, autonomous system management, and industrial human factors.

URL: http://mc.manuscriptcentral.com/ttie

Theoretical Issues in Ergonomics Science

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960


2

Abstract

A novel spatial display for a classic signal detection problem was tested in an

experiment with thirty naïve volunteers and two experts, measuring the signal

detection properties of a spatially-organized display compared with temporally-

organized display. The collective performance of committees was also measured.

The spatial display was quantitatively and qualitatively superior, resulting in a

4.6% increase in correctness of classifications and a 30.5% decrease in miss

percentages. Qualitatively, 83% of participants preferred the new display. The

collective receiver operating characteristic (ROC) performance of the top half of

best performing naive participants was superior to that of most individuals and all

experts. However, in an analysis that examined the value added of both the

improved spatial display and the collective intelligence approach, the results

demonstrated that the bulk of performance gains can be attributed to the improved

design, as compared to the value of collective action, by roughly 2:1.

Relevance to human factors/ergonomics theory: This study compares a traditional temporal

display and a spatial display for a human attempting classification in a classic signal detection

setting. Collective intelligence improved overall performance, but following established best

practices for decision support displays provided bigger performance gains.

Keywords: signal detection, display design, collective intelligence, ground penetrating radar,

receiver operating characteristic

of 21



123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960


3

Introduction With advances in radar technology, signal processing, and artificial intelligence,

human operators are increasingly employed to determine the presence or absence of

various signals, including baggage and personnel screening and optical inspection such as

systems that scan underneath cars for explosive devices. Because of the uncertainty in

such systems and the lack of clear signal thresholds, the detection of objects with

automated technologies has not advanced to the point where human judgment is no longer

needed. Therefore investigating the ability of human operators to distinguish objects from

noise is important to the advancement of object detection systems. In addition, modeling

human behavior in this domain provides insights into how automation could assist a

human, and vice-versa. Moreover, understanding the currently achievable levels of

performance and the relative merits of both human and automation elements enables

engineers to construct the most functional systems possible.

The best possible signal detection system in these environments not only effectively

differentiates signals from noise but also provides operators with efficient and reliable

means of finding objects of interest. One key aspect of such a system is the human

interface design, i.e., how well the system assists an operator in object detection.

Typically such systems are designed so that a single operator evaluates a solution, e.g., a

single baggage screener looks at a screen to determine whether an illegal object is in an

image. However, recent research suggests one form of “collective intelligence”, known as

the wisdom of crowds, can be superior to an individual’s decision when pooling

independent answers to a question from multiple people yields a superior solution than

that proposed by a few people (Surowiecki 2004). Unfortunately, there is little to no

research that examines the value added by taking such a collective intelligence approach

to a real world problem, which could have significant personnel and scheduling

implications as compared to better designing displays for individuals.

To this end, in order to investigate both the role of improved decision support and the

potential added value of collective intelligence in an image detection task, a test bed was

used that leveraged an experimental vehicle-mounted ground penetrating radar (GPR)

of 21



123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960


4

simulation, developed by MIT Lincoln Laboratory (Stanley 2016). This system consisted

of an electronically controlled, but mechanically actuated stock vehicle, GPR array,

computer system, and the display for the operator (described in the next section). The

vehicle traveled a previously traversed route and continuously compared the incoming

GPR data to previously recorded GPR data. The goal of this system was to allow an

operator to find buried objects via a change detection paradigm. The chief question

operators must answer is whether they think the display depicts an object or a false alarm.

Such technologies are critical for saving lives in the detection of Improvised Explosive

Devices (IEDs), a current real world application of GPRs.

Improving the design

Figure 1 depicts the original GPR operator display. The radar data is captured and

analyzed by various algorithms using the location of each sensor at the time the data is

recorded. In the original interface, this sensor data is displayed to the user via colored

rectangles, organized by time (on the horizontal axis) and radar elements (on the vertical

axis, Figure 1). A fixed number of sensor readings is displayed at all times (Figure 1(iii)),

with older values being removed from the left edge to make room for new readings added

to the right edge (Figure 1(i)), simulating the forward motion of the vehicle.

Figure 1: Original GPR display

(iv) Fixed number of time steps

(iii) Fixed number of radar elements

(ii) Region of no data (i) Most recent data appears

of 21



123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960


5

Since the GPR data is captured on a polling timer, this means that the graphic

representations of objects are distorted on the display depending on how quickly the

sensor is traversing the ground. Objects are also distorted by the turning of the vehicle,

which decreases the temporal spacing between sensor readings on the inside radius of the

turn compared to the outside of the turn. No measurements are taken if the forward

movement is very small, if the vehicle is stationary, or if the vehicle is moving backward.

Operators of the actual original GPR radar found it difficult to interpret, particularly

with the distortion and the representation of the front of the vehicle on the right side of

the display, which was constantly updating. In order to address the design shortcomings

of the original display, we proposed a redesign of the user interface (Figure 2).

The old interface (Figure 1) displays the newest data on the right edge of the display

(Figure 1(i)). This is called a ‘‘track-right’’ display and presents the viewer with the

experience of traveling to the right. This presentation of the data is at odds with the fact

that the vehicle is moving forward, corresponding to the top edge of the display. In

(i) Direction offorwardmotion

(vi) Radar array outline

(iii) Detection alert

(iv) Previous detection alert

(vii) Tire outline

Figure 2: Proposed redesigned GPR display

(v) Reference traversal

(ii) No data

(viii) 1m grid

of 21



123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960


6

keeping with the principle of the moving part (Wickens and Hollands 2000, Roscoe 1968),

the new interface uses a ‘‘track-up’’ display with the newest data at the top (Figure 2(i)).

This approach allows new data to populate along the top edge of the display, indicative of

vehicle forward motion, rather than populating along the right edge of the image. To further

improve situation awareness, the physical location of the radar array on the front of the

vehicles was added (Figure 2(vi)).

In order to further improve the ability of operators to understand where a signal is

located geographically from one of the sensors, the new display renders the data as points

using a nearest-neighbor interpolation to avoid removing key information. This should

help improve the operators’ ability to perceive where the signal is located geographically,

thus improving their perception-based situation awareness (Endsley, 2000). Areas of no

data only show a gray background (Figure 2(ii)), as opposed to the blue background of

Figure 1(ii)).

Detection indicators were added to the new display to help the operator identify what

data to examine, in order to reduce their working memory requirements (Nielsen 1993)

(Figure 2(iii)). Large triangles indicate current detection events (Figure 2(iii)), while small

triangles indicate previous detection events (Figure 2(iv)). These indicators help operators

maintain a temporal understanding of what has previously been analyzed and what is new.

Other improvements that were made to reduce the time needed to build situation

awareness and an effective mental model (Staggers and Norcio 1993) focused on making

data relationships visible including adding a black line on each side of the path of the

vehicle’s sensors to facilitate understanding the path of the current versus reference

traversal (Figure 2(v)). In addition, the radar array and the two front tires (Figure 2 (vi &

vii, respectively)) were made visible and a faint grid overlay was added to provide a scale

(Figure 2(viii)).

The design was revamped to reduce the cognitive complexity of the display (Xing

2006). The original colormap for GPR was a full rainbow, chosen because it was thought to

have superior contrast to standard black and white colormaps and is standard in GPR

systems (Hunt, Massie, and Cull 2000). The new interface in Figure 2 uses a simpler

colormap, based on three colors, a blue, a light gray, and a red. The blue is the lowest radar

of 21



123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960


7

return value shown and the red is the highest radar return value shown. A dark gray is used

to signify a lack of data (Figure 2(ii)). Not only does this display reduce the cognitive

complexity associated with color interpretations, but it also fulfilled the requirement of

avoiding combinations of colors confused by individuals with deuteranopia and tritanopia

color blindness (Harrower and Brewer 2003).

The new design of the GPR display was based on following established human-

centered design principles such as the principle of the moving part, reduced cognitive

complexity, design for recognition instead of recall, and making critical data relationships

visible. Testing should reveal whether the design was improved, as well as how these

design changes affected performance, both individually and collectively.

Method

Participants

Thirty-four participants from a large technology research and development

company were recruited over email and through word of mouth. Three participants were

considered experts due to prior affiliation with the project in that they were key members

in the design of the system as well as the operator display. One non-expert participant and

one expert were disqualified due to colorblindness, yielding a pool of thirty experimental

participants and two experts. Out of thirty participants, twenty-one were male and nine

female. Participants ranged in age from twenty-two to seventy-two (mean = 41 years, SD

(standard deviation) = 13 years). All participants had self-reported corrected vision 20/25

or better.

Simulation

While the fielded GPR system used a commercial flat screen display mounted on the

dashboard with resistive touchscreen capabilities, for convenience and cost savings, this

experiment was conducted using an Apple iPad® laying flat on a tabletop. The iPad had

the same pixel density (52 pixels/cm) as the fielded display, with similar resistive

touchscreen capabilities. The results from this study are not expected to be directly

transferable to the fielded system, but rather are meant to examine relative differences

of 21



123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960


8

given both this specific display improvements as well as the use of a collective approach.

Experiment Design

This study was a within-subjects, repeated measures design. Forty-eight images

were noise and 54 contained a signal. The order of the scenarios was counterbalanced and

randomized for every participant and between the two interfaces. Scenario order

(First/Second) was counterbalanced and randomized across participants and between

interfaces (Old/New). Participants completed 102 scenarios on one interface, then 102

scenarios on the other.

Procedure

Each participant was briefed on the interface, the symbology and shown ten scenarios

with the correct answers. Then they practiced on their own, with these same ten scenarios

presented in a different order. For each scenario presented, participants could press one of

two buttons labeled “Object” or “False Alarm”. After submitting each answer in the

training sequence, participants were shown the correct answer. Figure 3 shows example

representations of Noise and Objects as indicated on the Old vs. New displays. Figure 3b

illustrates the selection button.

At the end of the ten training scenarios, a summary screen displayed the results and the

experimenter talked through the incorrect answers with the participant and answered any

questions. If the participant incorrectly answered three or more scenarios, a remedial

training protocol was given which included an additional six scenarios. Three participants

required the remedial training protocol for one of the two interfaces, and one participant

required the remedial training protocol for both interfaces. Participants were informed that

their performance would only be assessed by the correctness of their answers, and not by

the time they took to complete the exercise. Participants were also informed of a $100 gift

certificate for the best performance.

of 21



123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960


9

After training the test scenarios began. Unlike the training session, participants

were not allowed to change their answer nor did they receive feedback on their

performance. They were also not shown the correct answers for each scenario. After each

answer, a slider was displayed and the participant was asked to rate his or her confidence

Object present on old interface (left) and on new interface (right)

Object not present (noise) on old interface (left) and on new interface (right), with selection button presented for each image

Figure 3: Example scenarios on old and new interfaces

Object False Alarm or

of 21



123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960


10

in a given answer on a seven-point scale ranging from not at all confident to very

confident. Every ten scenarios, a slide was shown that suggested the participant could take

a break if desired.

Results Given the within-subjects design of this experiment, paired t-tests were used for

data that met underlying normality assumptions. The Wilcoxon signed-rank test was used

to evaluate paired samples and the Mann Whitney U test was used to evaluate

independent samples when nonparametric measures were required, alpha = .05.

Participants correctly categorized more GPR scenarios using the new interface,

t(29) = 2.66, p = .013. The new interface had a slightly higher average of percentage

correct as compare to the old (Mold = 66.8% SDold = 4.4%, Mnew = 69.9% SDnew = 4.8%),

which is a 4.6% improvement (Figure 4a.)

Participants had significantly fewer misses on the new interface, z = 3.77, p < .001 (Mold

= 19.0% SDold = 6.0%, Mnew = 13.2% SDnew = 6.1%) (Figure 4b). A miss occurs when

there is an object present, but the participant elects to select the “False Alarm” button.

This was a 30.5% improvement over the old interface.

Figure 4: (a) Correct (left) and (b) miss (right) ratios using the old and new interfaces with 95% confidence intervals

of 21



123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960


11

Insufficient evidence was found to suggest participants were more confident in

their answers on either interface, t(29) = 1.92, p = .065), although the trend was in

favor of the new interface (Mold = 4.62 SDold = .64, Mnew = 4.71 SDnew = .70).

However, this metric does not capture the confidence of participants as a

function of their performance. To this end, a metric was calculated to assess the

appropriateness of participants’ confidence in their answers. This metric was

calculated as follows:

Confidence correctness = Σxici / Σci

xi ≡ 1 if the participant answered scenario i correctly, or -1 otherwise

ci ≡ the participant’s confidence [1 − 7] in their answer for scenario

The confidence-correctness score calculates higher scores for correct scenarios. This

straightforward computation weights correct answers by the confidence of participants and

is needed since previous research indicates that subjective confidence ratings often do not

align with task difficulty (Nickerson and McGoldrick 1963). Participants’ confidence-

correctness scores statistically favored the new interface, t(29) = 3.75, p = .001 (Mold =

.38 SDold = .09, Mnew = .47 SDnew = .11).

A notable demographic result is that women performed better than men in

many respects, but only on the new interface. For the improved interface, women

answered more scenarios correctly, t(29) = 3.20, p = .003 (Mmale = 68.1% SDmale = 4.4%,

Mfemale = 73.6% SDfemale = 3.7%), missed fewer objects, Mann Whitney U = 44.5, p =

.022 (Mdmale = 17.8%, Mdfemale = 8.3%), had a higher confidence correctness score,

t(29) = 2.40, p = .023 (Mmale = .44 SDmale = .10, Mfemale = .53 SDfemale = .08), and had a

higher average confidence, t(29)=2.65, p = .013 (Mmale = 4.56 SDmale = .66, Mfemale =

5.13 SDfemale = .64). These results run contrary to other findings that report women

tend to undervalue their own performance in similar situations (Ehrlinger and Dunning

2003).

In terms of subjective preference, participants overwhelmingly preferred the new

interface. When asked to assess their preference on a five-value Likert scale between the

interfaces, 83% preferred or strongly preferred the new interface, while 17% expressed

no preference or preferred the old interface.

of 21



123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960


12

The collective intelligence comparison

While improving the design of a decision support display through well-

established human-centered techniques was expected to produce improved

performance results for individuals, what is less well known is how such an

improvement could improve collective decision making. Moreover, while there has

been some research suggesting collective decision making can be superior to

individual responses in such image anomaly search detection tasks, there has been

very little research looking at the relative value added comparing decision support

improvements versus the use of collective intelligence.

For example, a study examining the relative performance between individuals and

groups in a skin cancer detection task has shown value in collective analysis (Kurvers

et al. 2015), as did another study looking at the power of group decision making in

mammography analysis (Wolf et al. 2015). While interesting and potentially useful in

these specific domains, there was no assessment of how a possibly poorly-designed

interface could contribute to subpar individual performance. Thus this study compares

both the improvements with individual and group decision making over the old

interface in order to examine the relative value added of decision support design

changes versus leveraging collective intelligence.

To this end, committee Receiver Operating Characteristic (ROC) curves were

obtained of various “committees” of people who effectively “voted” as to the presence

or absence of a target. These votes were recorded from the experimental results

described in the previous section and were combined to see how the performances of

various groups compared to individual performances. For ranking purposes, performers

were ordered by the total area under the convex hull of their ROC curve to determine their

membership in committees that ranged in size from one (the best performer) to a

committee of 30 (representing all participants.) Intermediate committee sizes included 3,

5, and 15 people.

Given that errors vary with the square root of the number of samples, we elected to

use the largest committees possible to determine the value added of collective intelligence

versus improved display design. This resulted in choosing two committees of the 15 best

performers and the worst 15. Then the aggregated best and worst performances of these

of 21



123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960


13

participants were compared to individual performances to understand the range of likely

outcomes. Further details about additional formulations of various committees and their

performances, particularly in comparing the best and worst performers, can be found

elsewhere (Quimby 2013).

The ROC curves were computed from each participant’s True Positive Rate (TPR)

and False Positive Rate (FPR) for both interfaces (Figure 5). To determine the top and

bottom committee membership, performers were ordered by the total area under the

convex hull of their ROC curve, which is a technique that creates iso-performance

lines for classifiers, i.e., the participants (Provost and Fawcett 2001). The 30

participant ROC curves are represented by the faint blue and red lines in Figure 5, with

the two committee performances for the old and new interfaces in the bold and dashed

lines.

As shown in Figure 5, the interface mattered, with the new interface consistently

outperforming the old interface when TPR values were greater than 60%. As expected,

in all cases the better committees using both interfaces outperformed the worse

committees. In general for signal detection tasks, the goal is to have high TPRs but low

FPRs, making the upper left region the most desirable.

While Figure 5 demonstrates that the committee formed from the best people

performed better using the new interface, it does not answer how the committee

performed when compared to the best performer, which is shown in Figure 6. The

Figure 5: ROC curves for best and worst committees of 15 with old and new interfaces

of 21



123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960


14

performance of the 15 person best committee (labeled in Figure 6 as 15-Best

(Old/New)) appears to be better or equivalent to the best performer for the new

interface (labeled in Figure 6 as 1-Best (Old/New)), while the best performer under the

old interface was better than the committee between 55-80% TPR.

Given the nature of ROC curves, to be able to more effectively compare the

relative performance impacts of both the interface design as well as the value of

collective intelligence, an anchor metric must be selected. As discussed previously,

ideally the ROC curves would be in the upper left corner, and indeed, for very mature

tests such as a peptide test to diagnose heart failure (Florkowski 2008), it is routine to

have both high TPRs as well as low FPRs. However, such curves are not common for

ground radar systems and there is a distinct tradeoff that must be made between TPRs

and FPRs.

For GPR systems, the cost of a false positive is not necessarily high for a single

incident, in that if a suspected detection is false, this just means that time and resources

are lost in the investigation and perhaps detonation of a falsely suspected device like

an IED. However, while one single false positive is not necessarily as problematic as it

might be in an unneeded medical procedure that could harm a patient, repeated false

positives in a GPR setting could be extremely dangerous. Investigation of false

positive incidents adds significantly more time to missions, and in the case of IED

detection, ultimately exposes soldiers to potentially harmful enemy fire. Ultimately,

these soldiers could lose trust in such a system so while the acceptance of some false

Figure 6: Old (left) and new (right) interface ROCs with best performer comparison

of 21



123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960


15

positives is permissible, they should be minimized (Cazares 2017).

In order to compare both the relative impact of the display redesign and the use

of collective intelligence, Table 1 summarizes the false positive rates (FPR) for various

individuals and groups for the experiment detailed previously for two given true

positive rate (TPR) thresholds of 80% and 90%. These thresholds are typical for GPR

systems (Chambers et al. 2013). The “Interface FPR Difference” column of Table 1

displays the percent decrease in false positive rates for users of the new interface as

compared to the old interface. Similarly, the row “Best 15 FPR Difference Over

Individuals” in Table 1 displays the percent decrease in false positive rates for the

committee of best 15 performers compared to the best and worst performers, as well as

consensus of all 30 individuals. It should be noted that the assignments of Best and

Worse Individuals means that they were ranked by the total areas under the convex

hulls of their ROC curves for overall best performance, and so may not necessarily

have the highest performance at any discrete point. Similarly, the Group Consensus

false positive rates represent the majority performance for 30 people combined.

Table 1: False positive rates for individuals and groups for 80% and 90% true positive rates for both

interfaces

Old 80% TPR

New 80% TPR

Old 90% TPR

New 90% TPR

Interface FPR Difference

80% 90%

Best Individual 32% 25% 66% 35% 22% 47%

Best 15 Committee 35% 24% 51% 32% 31% 37%

Worst Individual 78% 63% 89% 82% 19% 8%

Worst 15 Committee 48% 33% 71% 40% 31% 44%

Group Consensus 38% 25% 60% 37% 34% 38%

Best 15 FPR

Difference Over

Individuals

Best -9% 4% 23% 9%

Group 8% 4% 15% 14%

Worst 55% 62% 43% 61%

Table 1 illustrates that across both individuals and groups of decision makers, at an

80% TPR, the new interface provides a 19-34% reduction in false positives, and an 8% -

47% reduction at the 90% TPR. Not surprisingly, the new interface provided the smallest

improvement for the worst individual at both TPRs, but the biggest FPR reduction (47%)

of 21



123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960


16

occurred for the best individual using the new interface for the 90% TPR. When

considering how much the new interface improved group consensus performance, the new

interface reduced FPRs by 34% at 80% TPR and 38% at 90% TPR.

The remaining question is how much value was added by using the collective

intelligence approach over various decision makers? In Table 1, using the committee of

the top 15 participants as the basis of comparison, the biggest improvement in the FPRs

for both TPRs is for the worst performer ranging from 33%-63%, depending on the

interface. However, in comparison to the magnitude of reduction for the improved display

design, the improvements made by adding a collective intelligence component were

modest when compared to the best decision makers or even the group as a whole, ranging

from 4% to 23%. Indeed, in one case the collective best 15 group performed worse than

the best performer, increasing the FPR by 9% using the old interface.

Expert users

In addition to the “best” performers for both the old and new interfaces, we also

were fortunate to have access to two experts that were key personnel in the design of the

GPR system (beyond the thirty participants in the experiment, and they played no role in

the design and conduct of this research study). These experts were heavily involved in

the design of the GPR system. Figure 7 shows the ROC performance of these experts in

comparison to the committee and the best performers. These experts performed worse

Figure 7: Selected ROC curves for experts and 15 and 30 person committees

of 21



123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960


17

than the 15-Best committee for both interfaces, in addition to performing worse than the

aggregation of all 30 participants (depicted in Figure 7 as the 30-Committee). Generally

both experts were on par, or slightly worse than the 15-Worst person committee.

Individually, the experts were ranked 5 and 20 out of 32 on the old interface (when

the thirty participants were combined with the 2 experts) in terms of correct detections,

and they were ranked 10 and 11 out of 32 on the new interface. This indicates that the

experts were potentially better than many participants, but still outperformed by many

others who had substantially less experience with the system. This finding suggests two

important points. First, experts who design a system cannot be relied upon to provide

performance benchmarking. In this experiment, people with one hour of experience were

able to significantly outperform experts with hundreds of hours of experience with the

system. Arguably their expertise may have instilled bias in their decision making, but one

question this result raises is that for such anomaly search tasks, is there a point at which

experience becomes detrimental to performance?

In addition, while the use of collective intelligence to actively process images may

not be feasible for many agencies in terms of scheduling and personnel requirements (and

Table 1 suggests improved display design carries more weight), this research suggests

that the use collective evaluation approaches may be useful in the design stages to

illustrate deficiencies and also help stakeholders like designers better understand the

limitations of their own designs.

Conclusions

Human operators are increasingly employed to determine the presence or absence of

various signals in image search tasks like those in medicine and security. As previously

discussed, recent research has suggested that collective intelligence, i.e., using groups to

vote on the presence of a signal, can improve overall outcomes. However, how such

collective intelligence performance improvements compare to those gained by ensuring

decision support displays embody user-centered design principles has yet to be investigated

in the open literature.

To this end, a current Ground Penetrating Radar (GPR) display organized temporally

of 21



123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960


18

was improved through widely accepted best design practices including the principle of the

moving part and making critical data relationships visible. Using traditional within-subjects

t-tests, the new spatially organized display was superior in terms of overall correct target

identification and a 30.5% decrease in missed targets. Interestingly, women in this study

dominated in terms of performance, which could be an area for further study. Subjectively,

participants preferred the new interface by an overwhelming majority.

To further complement this analysis to determine the impact of adding a collective

intelligence component to the study, receiver operating characteristic (ROC) curves were

constructed with various groups of people include the best 15 and worst 15 performers.

Results demonstrated that the committee of the best 15 people generally performed better

on both interfaces than the worst 15 participants and the experts who created the system.

Thus these results are in line with other studies that show there is some advantage to the

use of collective intelligence for such image search tasks (Wolf et al. 2015, Kurvers et al.

2015). However, in an analysis that compared false positive rate changes for two different

true positive rates in the ROCs for both the old and new interfaces, as well as for groups vs.

individual decision makers, the FPR reduction occurs more for the improved design

intervention as compared to the value of collective action, by roughly 2:1. This was

especially true for top performers.

These results suggest that when trying to improve performance of humans in signal

detection tasks, the biggest gains could be made by ensuring decision support displays

embody basic design principles (Wickens and Hollands 2000). While the collective

intelligence approach did add value, such a decision support could be difficult to construct,

particularly for real time situations like those faced by some military personnel. In these

applications, it is then critical to make sure the displays are well designed. Moreover, the

results also showed that individual characteristics could potentially have a significant

impact, and that expert designers should not necessarily be relied upon to set performance

standards. While this initial analysis demonstrated many interesting results, further analysis is

warranted.

Acknowledgements

of 21



123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960


19

This work is sponsored by the Department of the Army under Air Force Contract #FA8721-05-

C-0002, and in collaboration with MIT Lincoln Laboratory. We extend special thanks to the

reviewers who were very patient and thorough.

Key Points

• The spatial display was quantitatively and qualitatively superior to the temporal

display. The new display resulted in a 4.6% increase in correctness of the

participants’ classifications and a 30.5% decrease in miss percentage.

• Qualitatively, 83% of participants preferred the new display.

• Using a collective intelligence approach, a committee of the best 15 people

outperformed most individuals and all of the experts.

• When comparing the relative contribution of display design improvements to those of

the collective intelligence approach, the bulk of performance gains due to the improved

design outweigh the value of collective action by roughly 2:1

References

Cazares, Shelley M. 2017. "The Threat Detection System That Cried Wolf: Reconciling Developers with Operators." Defense Acquisition Research Journal 80:42-64.

Chambers, D. H., D. W. Paglieroni, J. E. Mast, and N.R. Bee. 2013. Real-Time Vehicle-Mounted Multistatic Ground Penetrating Radar Imaging System for Buried Object Detection. Livermore, CA: Lawrence Livermore National Security, LLC,.

Ehrlinger, J., and D. Dunning. 2003. "How chronic self-views influence (and potentially mislead) estimates of performance." Journal of Personality and Social Psychology 84 (1):5-17.

Florkowski, C.M. 2008. "Sensitivity, Specificity, Receiver-Operating Characteristic (ROC) Curves and Likelihood Ratios: Communicating the Performance of Diagnostic Tests." Clinical Biochemist Reviews 29 (1):83-87.

Harrower, M., and C. A. Brewer. 2003. "ColorBrewer.org: An Online Tool for Selecting Colour Schemes for Maps." The Cartographic Journal, 40 (1):27-27.

Hunt, L. D., D. Massie, and J. P. Cull. 2000. "Standard palettes for GPR data analysis." Eighth International Conference on Ground Penetrating Radar, Gold Coast, Australia.

Kurvers, RHJM., J. Krause, G. Argenziano, I. Zalaudek, and M. Wolf. 2015. "Detection Accuracy of Collective Intelligence Assessments for Skin Cancer Diagnosis." AMA Dermatology 151 (12):1346-1353.

Nickerson, Raymond S., and Charles C. McGoldrick. 1963. "Confidence, Correctness, and Difficulty with Non-Psychophysical Comparative Judgments." Perceptual and Motor Skills 17 (1):159-167.

Nielsen, J. 1993. Usability engineering. Cambridge, MA: Academic Press Provost, Foster, and Tom Fawcett. 2001. "Robust classification for imprecise environments." Machine

Learning Journal 42 (3):203–231. Quimby, P.W. 2013. "A Spatial Display for Ground-Penetrating Radar Change Detection." Master's,

of 21



123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960


20

Aeronautics and Astronautics, Massachusetts Institute of Technology. Roscoe, S. N. 1968. "Airborne displays for flight and navigation." Human Factors 10:321-332. Staggers, N., and A.F. Norcio. 1993. "Mental models: concepts for human-computer interaction

research." International Journal of Man-Machine Studies 38:587-605. Stanley, Byron. 2016. "Localizing Ground Penetrating Radar." Tech Notes, June Surowiecki, James. 2004. The Wisdom of Crowds. New Work: Anchor Books. Wickens, Chris D., and Justin G. Hollands. 2000. Engineering Psychology and Human Performance. 3rd

ed. Upper Saddle River, N.J.: Prentice Hall. Wolf, M., J. Krause, P.A. Carney, A. Bogart, and R.H.J.M. Kurvers. 2015. "Collective Intelligence Meets

Medical Decision-Making: The Collective Outperforms the Best Radiologist." PLoS ONE 10 (8): e0134269.

Xing, J. 2006. Color and visual factors in ATC displays. Washington DC: FAA Civil Aerospace Medical Institute.

of 21



123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960


21

of 21



123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

the power of collective intelligence in a signal detection

Documents