in press at behavior research methods doi : 10.3758/s13428-017-0886-6 · 2017. 3. 6. · doi :...
TRANSCRIPT
Runninghead:CHANGEDETECTIONRELIABILITY
INPRESSATBEHAVIORRESEARCHMETHODS
DOI:10.3758/s13428-017-0886-6
Thereliabilityandstabilityofvisualworkingmemorycapacity
Xu,Z.1*,Adam,K.C.S.2*,Fang,X.1,&Vogel,E.K.2
1SchoolofPsychology,SouthwestUniversity,Chongqing,China2DepartmentofPsychology,UniversityofChicago,Chicago,IL
*Theseauthorscontributedequallytothework.WordCount:7151Figures:6Tables:4Keywords:visualworkingmemory,reliability,changedetectionContributions:Z.X.andE.V.designedtheexperiments;Z.X.andX.F.collecteddata.K.A.performedanalysesanddraftedthemanuscript.K.A.,Z.X.,andE.V.revisedthemanuscript.Acknowledgements:ResearchwassupportedbytheProjectofHumanitiesandSocialSciences,MinistryofEducation,China(15YJA190008),theFundamentalResearchFundsfortheCentralUniversities(SWU1309117),NIHgrant2R01MH087214-06A1andOfficeofNavalResearchgrantN00014-12-1-0972.DatasetsforallexperimentsareavailableonlineonOpenScienceFrameworkathttps://osf.io/g7txf/.ConflictsofInterest:noneCorrespondenceto:KirstenC.S.AdamUniversityofChicago940E57thSt,Chicago,IL60637+1(773)[email protected]
ChangeDetectionReliability2
Abstract1
Becauseofthecentralroleofworkingmemorycapacityincognition,manystudies2
haveusedshortmeasuresofworkingmemorycapacitytoexamineitsrelationshiptoother3
domains.Here,wemeasuredthereliabilityandstabilityofvisualworkingmemory4
capacity,measuredusingasingle-probechangedetectiontask.InExperiment1,subjects5
(N=135)completedalargenumberoftrialsofachangedetectiontask(540intotal,1806
eachofset–sizes4,6,and8).Withlargenumbersoftrialsandsubjects,reliabilityestimates7
werehigh(α>.9).Wethenusedaniterativedownsamplingproceduretocreatealook-up8
tableforexpectedreliabilityinexperimentswithsmallsamplesizes.InExperiment2,9
subjects(N=79)completed31sessionsofsingle-probechange-detection.Thefirst3010
sessionstookplaceover30consecutivedays,andthelastsessiontookplace30dayslater.11
Thisunprecedentednumberofsessionsallowedustoexaminetheeffectsofpracticeon12
stabilityandinternalreliability.Evenaftermuchpractice,individualdifferenceswere13
stableovertime(averagebetween-sessionr=.76).14
15
16
17
18
19
20
21
22
23
ChangeDetectionReliability3
WorkingMemoryCapacity(WMC)isacorecognitiveabilitythatpredictsperformance24
acrossmanydomains.Forexample,WMCpredictsattentionalcontrol,fluidintelligenceand25
real-worldoutcomessuchasperceivinghazardswhiledriving(Engle,Tuholski,Laughlin,&26
Conway,1999;Fukuda,Vogel,Mayr,&Awh,2010;Wood,Hartley,Furley,&Wilson,2016).27
Assuch,researchersareofteninterestedindevisingbriefmeasuresofWMCtoinvestigate28
therelationshipofWMCtoothercognitiveprocesses.However,truncatedversionsofWMC29
taskscouldpotentiallybeinadequateforreliablymeasuringanindividual’scapacity.30
Inadequatemeasurementcouldobscurecorrelationsbetweenmeasuresorevendifferences31
inperformancebetweenexperimentalconditions.Furthermore,whileWMCisconsideredto32
beastabletraitoftheobserver,littleworkhasdirectlyexaminedtheroleofextensive33
practiceonthemeasurementofWMCovertime.Thisisofparticularconcernduetothe34
popularityofresearchexaminingwhethertrainingaffectsWMC(Melby-Lervåg&Hulme,35
2013;Shipstead,Redick,&Engle,2012).Extensivepracticeonanygivencognitivetaskhas36
thepotentialtosignificantlyalterthenatureofthevariancethatisdetermining37
performance.Forexample,extensivepracticehasthepotentialtoinducearestrictionof38
rangeproblem,inwhichthebulkoftheobserversreachsimilarperformancelevels-thus39
reducinganyopportunitytoobservecorrelationswithothermeasures.Consequently,a40
systematicstudyofthereliabilityandstabilityofWMCmeasuresiscriticalforimproving41
themeasurementandreproducibilityofmajorphenomenainthisfield.42
Inthepresentstudy,weseektoestablishthereliabilityandstabilityofone43
particularWMCmeasure:ChangeDetection.Changedetectionmeasuresofvisualworking44
memoryhavegainedpopularityasameansofassessingindividualdifferencesincapacity.In45
atypicalchangedetectiontask,participantsbrieflyviewanarrayofsimplevisualitems46
ChangeDetectionReliability4
(~100to500ms),suchascoloredsquares,andremembertheseitemsacrossashortdelay47
(~1to2seconds).Attest,observersarepresentedwithanitematoneoftheremembered48
locations,andtheyindicatewhetherthepresentedtestitemisthesameastheremembered49
item(“no-change”trial)orisdifferent(“changetrial”).Performancecanbequantifiedasraw50
accuracyorconvertedintoacapacityestimate(“K”).Incapacityestimates,performancefor51
changetrialsandno-changetrialsiscalculatedseparatelyashits(proportionofcorrect52
changetrials)andfalsealarms(proportionofincorrectno-changetrials)andconvertedinto53
aset-sizedependentscore(Cowan,2001;Pashler,1988;Rouder,Morey,Morey,&Cowan,54
2011).55
Thereareseveralbeneficialfeaturesofchangedetectiontasksthathaveledtotheir56
increasedpopularity.First,changedetectionmemorytasksaresimpleandshortenoughto57
beusedwithdevelopmentalandclinicalpopulations(e.g.Cowan,Fristoe,Elliott,Brunner,58
&Saults,2006;Gold,Wilk,McMahon,Buchanan,&Luck,2003;Leeetal.,2010).Second,the59
relativelyshortlengthoftrialslendsthetaskwelltoneuralmeasuresthatrequirelarge60
numbersoftrials.Inparticular,neuralstudiesemployingchangedetectiontaskshave61
providedstrongcorroboratingevidenceofcapacitylimitsinWM(Todd&Marois,2004;62
Vogel&Machizawa,2004),andhaveyieldedinsightsintopotentialmechanismsunderlying63
individualdifferencesinworkingmemorycapacity(forreview,see:Luria,Balaban,Awh,&64
Vogel,2016).Finally,changedetectiontasksandclosely-relatedmemory-guidedsaccade65
taskscanbeusedwithanimalmodelsfrompigeons(Gibson,Wasserman,&Luck,2011)to66
non-humanprimates(Buschman,Siegel,Roy,&Miller,2011),providingarareopportunity67
todirectlycomparebehaviorandneuralcorrelatesoftaskperformanceacrossspecies68
(Elmore,Magnotti,Katz,&Wright,2012;Reinhartetal.,2012).69
ChangeDetectionReliability5
Amainaimofthisstudyistoquantifytheeffectofmeasurementerrorandsample70
sizeonthereliabilityofchangedetectionestimates.Inpreviousstudies,changedetection71
estimatesofcapacityhaveyieldedgoodreliabilityestimates(e.g.Pailian&Halberda,2015;72
Unsworthetal.,2014).However,measurementerrorcanvarydramaticallywiththe73
numberoftrialsinatask,thusimpactingreliability;PailianandHalberda(2015)found74
thatreliabilityofchangedetectionestimatesgreatlyimprovedwhenthenumberoftrials75
wasincreased.Researchersfrequentlyemployvastlydifferentnumbersoftrialsand76
subjectsinstudiesofindividualdifferences,buttheeffectoftrialnumberonchange-77
detectionreliabilityhasneverbeenfullycharacterized.Instudiesusinglargebatteriesof78
tasks,timeandmeasurementerrorareforcesworkinginoppositiontooneanother.When79
researcherswanttominimizetheamountoftimethatatasktakes,measuresareoften80
truncatedtoexpediteadministration.Suchtruncatedmeasuresincreasemeasurement81
noiseandpotentiallyharmthereliabilityofthemeasure.Atpresent,thereisnoclear82
understandingofeithertheminimumnumbersofsubjectsandtrialsthatarenecessaryto83
obtainreliableestimatesofchangedetectioncapacity.84
Inadditiontomeasurementerrorwithin-session,reliabilityofindividualdifferences85
couldbecompromisedwithextensivepractice.Previously,itwasfoundthatvisualworking86
memorycapacityestimateswerestable(r=.77)after1.5yearsbetweentestingsessions87
(Johnsonetal.,2013).However,theeffectofextensivepracticeonchangedetection88
estimatesofcapacityhasyettobecharacterized.Extensivepracticecouldharmthe89
reliabilityandstabilityofmeasuresinacoupleofways.First,itispossiblethatparticipants90
couldimprovesomuchthattheyreachperformanceceiling,thuseliminatingvariability91
betweenindividuals.Second,ifindividualdifferencesareduetotheutilizationofoptimal92
ChangeDetectionReliability6
versussub-optimalstrategies,thenparticipantsmightconvergetoacommonmeanafter93
engaginginextensivepracticeandfindingoptimaltaskstrategies.Bothofthese94
hypotheticalpossibilitieswouldcallintoquestionthetruestabilityofworkingmemory95
capacityestimates,andlikewiseseverelyharmthestatisticalreliabilityofthemeasure.As96
such,inExperiment2wedirectlyquantifiedtheextentofextensivepracticeonthestability97
ofworkingmemorycapacityestimates.98
OverviewofExperiments99
Wemeasuredthereliabilityandstabilityofasingle-probechange-detection100
measureofvisualworkingmemorycapacity.InExperiment1,wemeasuredthereliability101
ofcapacityestimatesobtainedwithacommonlyusedversionofthecolorchange-detection102
taskforarelativelylargenumberofsubjects(n=137)andalargerthantypicalnumberof103
trials(t=540).InExperiment2,wemeasuredthestabilityofcapacityestimatesacrossan104
unprecedentednumberoftestingsessions(31).Becauseofthelargenumberofsessions,105
wecouldinvestigatethestabilityofchangedetectionestimatesafterextendedpracticeand106
overaperiodof60days.107
Experiment1108
MaterialsandMethods109
Participants110
Atotalof137individuals(35males;meanage=19.97,SD=1.07)withnormalor111
corrected-to-normalvisionparticipatedintheexperiment.Participantsprovidedwritten112
informedconsent,andthestudywasapprovedbytheEthicsCommitteeatSouthwest113
University.Participantsreceivedmonetarycompensationfortheirparticipation.Two114
ChangeDetectionReliability7
participantswereexcludedbecausetheyhadnegativeaveragecapacityvalues,resultingin115
afinalsampleof135subjects.116
Stimuli117
Stimuliwerepresentedonmonitorswitharefreshrateof75Hzandascreen118
resolutionof1024x768.Participantssatapproximately60cmfromthescreen,thougha119
chinrestwasnotusedsoallvisualangleestimatesareapproximate.Inaddition,therewere120
somesmallvariationsinmonitorsize(five16”CRTmonitors,three19”LCDmonitors)in121
testingrooms,leadingtosmallvariationsinthesizeofthecoloredsquaresfrommonitorto122
monitor.Detailsareprovidedabouttheapproximaterangeindegreesofvisualangle.123
AllstimuliweregeneratedinMATLAB(TheMathWorks,Natick,MA)using124
Psychophysicstoolbox.Coloredsquares(51pixels;rangeof1.55oto2.0ovisualangle)125
servedasmemoranda.Squarescouldappearanywherewithinanareaofthemonitor126
subtendingapproximately10.3oto13.35odegreeshorizontallyand7.9oto9.8odegrees127
vertically.Squarescouldappearinanyofninedistinctcolors,andcolorsweresampled128
withoutreplacementwithineachtrial(RGBvalues:Red=25500;Green=02550;Blue=00129
255;Magenta=2550255;Yellow=2552550;Cyan=0255255;Orange=2551280;130
White=255255255;Black=000).Participantswereinstructedtofixateasmallblackdot131
(Approximaterange:.36oto.47ovisualangle)atthecenterofthedisplay.132
Procedures133
Eachtrialbeganwithablankfixationperiodof1,000ms.Then,participantsbriefly134
viewedanarrayof4,6,or8coloredsquares(150ms)whichtheyrememberedacrossa135
blankdelayperiod(1,000ms).Attest,onecoloredsquarewaspresentedatoneofthe136
rememberedlocations.Therewasanequalprobabilitythattheprobedsquarewasthe137
ChangeDetectionReliability8
samecolor(no-changetrial)oradifferentcolor(changetrial).Participantsmadean138
unspeededresponsebypressingthe“z”keyifthecolorwasthesameandthe“/”keyifthe139
colorwasdifferent.Participantscompleted180trialsofset-sizes4,6,and8(540trials140
total).Trialsweredividedinto9blocks,andparticipantsweregivenabriefrestperiod(30141
seconds)aftereachblock.Tocalculatecapacity,changedetectionaccuracywas142
transformedintoaKestimateusingCowan’s(2001)formulaK=N×(H−FA),whereN143
representstheset-size,Histhehitrate(proportionofcorrectchangetrials),andFAisthe144
falsealarmrate(proportionofincorrectno-changetrials).Cowan’sformulaisbestfor145
single-probedisplaysliketheoneemployedhere.Forchangedetectiontasksusingwhole-146
displayprobes,Pashler’s(1988)formulamaybemoreappropriate(Rouderetal.,2011).147
Results148
Descriptivestatisticsforeachset-sizeconditionareshowninTable1,anddatafor149
bothExperiment1and2areavailableonlineonOpenScienceFrameworkat150
https://osf.io/g7txf/.Therewasasignificantdifferenceinperformanceacrossset-sizes,151
F(2,268)=20.6,p<.001,hp2=.133,andpolynomialcontrastsrevealedasignificantlinear152
trend,F(1,134)=36.48,p<.001,hp2=.214,indicatingthataverageperformancedeclined153
slightlywithincreasedmemoryload.154
155
156
157
158
159
160
ChangeDetectionReliability9
MeanK SD Min Max Kurtosis SkewnessSet-Size4 2.32 .70 .58 3.87 -.49 -.34Set-Size6 2.10 .97 .07 4.80 -.18 .34Set-Size8 1.98 .97 -.18 4.53 -.52 -.14Average 2.14 .82 .38 4.31 -.47 .07
Table1.DescriptivestatisticsforExperiment1.Descriptivestatisticsareshownseparately161
foreachset-sizeandfortheaverageofthethreeset-sizes.Kurtosisandskewnessvalues162
arebothcenteredaround0.Neitherkurtosisnorskewnesswascrediblynon-normalinany163
condition(Cramer,1997).164
ChangeDetectionReliability10
ReliabilityoftheFullSample:Cronbach’sAlpha165
WecomputedCronbach’salpha(unstandardized)usingKscoresfromthethreeset-166
sizesasitems(180trialscontributingtoeachitem),andobtainedavalueofa=.91167
(Cronbach,1951).WealsocomputedCronbach’salphausingKscoresfromthenineblocks168
oftrials(60trialscontributingtoeachitem)andobtainedanearlyidenticalvalueofa=169
.92.Finally,wecomputedCronbach’salphausingrawaccuracyforsingletrials(540items),170
andobtainedanidenticalvalueofa=.92.Thus,changedetectionestimateshadhigh171
internalreliabilityforthislargesampleofsubjects,andtheprecisemethodusedtodivide172
trialsinto“items”doesnotimpactCronbach’salphaestimatesofreliabilityforthefull173
sample.Further,usingrawaccuracyversusbias-correctedKscoresdidnotimpact174
reliability.175
ReliabilityoftheFullSample:Split-half176
Thesplit-halfcorrelationoftheKscoresforevenandoddtrialswasreliable,r=.88,177
p<.001,95%CI[.78.88].Correctingforattenuationyieldedasplit-halfcorrelationvalueof178
r=.94(Brown,1910;Spearman,1910).Likewise,thecapacityscoresfromindividualset-179
sizescorrelatedwitheachother:rss4-ss6=.84,p<.001,[95%CI.78.88];rss6-ss8=.78,p<180
.001,[95%CI.72.85];rss4-ss8=.76,p<.001,[95%CI.68.83].Split-halfcorrelationsfor181
individualset-sizesyieldedSpearman-Browncorrectedcorrelationvaluesofr=.91forset-182
size4,r=.86forset-size6,andr=.76forset-size8,respectively.183
Thedropincapacityfromset-size4toset-size8hasbeenusedintheliteratureasa184
measureoffilteringability.However,theinternalreliabilityofthisdifferencescorehas185
typicallybeenlow(Pailian&Halberda,2015;Unsworthetal.,2014).Likewise,wefound186
herethatthesplit-halfreliabilityoftheperformancedeclinefromset-size4toset-size8187
ChangeDetectionReliability11
(“4-8Drop”)waslow,withaSpearman-Browncorrectedcorrelationvalueofr=.24.While188
weak,thiscorrelationisthesamestrengthasreportedinearlierwork(Unsworthetal.,189
2014).Thesplit-halfreliabilityoftheperformancedeclinefromset-size4toset-size6was190
slightlyhigher,r=.39,andthesplit-halfreliabilityofthedifferencebetweenset-size6and191
set-size8performancewasverylow,r=.08.Thereliabilityofdifferencesscorescanbe192
impactedbothby(1)theinternalreliabilityofeachmeasureusedtocomputethe193
differenceand(2)thedegreeofcorrelationbetweenthetwomeasures(Rodebaughetal.,194
2016).Althoughtheinternalreliabilityofeachindividualset-sizewashigh,thepositive195
correlationbetweenset-sizesmayhavedecreasedthereliabilityoftheset-sizedifference196
scores.197
AnIterativeDownsamplingApproach198
Toinvestigatetheeffectsofsamplesizeandtrialnumberonreliabilityestimates,we199
usedaniterativedownsamplingprocedure.Tworeliabilitymetricswereassessed:(1)200
Cronbach’salpha,usingsingletrialaccuracyasitemsand(2)split-halfcorrelationsusing201
alltrials.Forthedownsamplingprocedure,werandomlysampledsubjectsandtrialsfrom202
thefulldataset.Numberofsubjects(n)wasvariedfrom5to135instepsof5.Thenumber203
oftrials(t)wasvariedfrom5to540instepsof5.Numberofsubjectsandnumberoftrials204
werefactoriallycombined(2916cellstotal).Foreachcellinthedesign,weran100205
samplingiterations.Oneachiteration,nsubjectsandttrialswererandomlysampledfrom206
thefulldatasetandreliabilitymetricswerecalculatedforthesample.207
Figure1showstheresultsofthedownsamplingprocedureforCronbach’salpha.208
Figure2showstheresultsofthedownsamplingprocedureforsplit-halfreliability209
estimates.Ineachplot,weshowboththeaveragereliabilityobtainedacrossthe100210
ChangeDetectionReliability12
iterations(Fig.1AandFig.2A)andtheworstreliabilityobtainedacrossthe100iterations211
(Fig.1BandFig.2B).Conceptually,wecouldthinkofeachiterationofthedownsampling212
procedureasakintorunningone“experiment”withsubjectsrandomlysampledfromour213
“population”of137.Whileitisgoodtoknowtheaverageexpectedreliabilityacrossmany214
experiments,thetypicalexperimenterwillonlyrunanexperimentonce.Thus,considering215
the“worstcasescenario”isinstructiveforplanningthenumberofsubjectsandthenumber216
oftrialstobecollected.Foramorecompletepictureofthebreadthofreliabilitiesobtained,217
wecanalsoconsiderthevariabilityinreliabilityacrossiterations(SD)andtherangeof218
reliabilityvalues(Fig.2C-2D).Finally,werepeatedthisiterativedownsamplingapproach219
foreachindividualset-size.Averagereliabilityaswellasthevariabilityofreliabilityfor220
individualset-sizesareshowninFigure3.Note,eachset-sizebeginswith1/3asmany221
trialsasFigures1and2.222
Next,welookedatsomepotentialcharacteristicsofsampleswithlowreliability(e.g.223
iterationswithparticularlylowversushighreliability).Weran500samplingiterationsof224
30subjectsand120trials,thenwedidamediansplitforhigh-versuslow-reliability225
samples.Therewasnosignificantdifferenceinthemean(p=.86),skewness(p=.60)or226
kurtosis(p=.70)ofhighversuslowreliabilitysamples.Therewas,however,asignificant227
effectofsamplerangeandvariability.Aswouldbeexpected,sampleswithhigherreliability228
hadalargerstandarddeviation,t(498)=26.7,p<.001,95%CI[.14.17],andawiderrange,229
t(498)=15.2,p<.001,95%CI[.52.67]),thansampleswithlowreliability.230
231
ChangeDetectionReliability13
Figure1.Cronbach’salphaasafunctionofthenumberoftrialsandthenumberofsubjectsin232
Experiment1.Ineachcell,Cronbach’salphawascomputedforttrials(x-axis)andn233
subjects(y-axis).(a)Averagereliabilityacross100iterations.(b)Minimumreliability234
obtained(worstrandomsampleofsubjectsandtrials).235
ChangeDetectionReliability14
Figure2.Spearman-Browncorrectedsplit-halfreliabilityestimatesasafunctionofthe236
numberoftrialsandsubjectsinExperiment1.(a)Averagereliabilityacross100iterations.237
(b)Minimumreliabilityobtained(worstrandomsampleofsubjectsandtrials).(c)238
Standarddeviationofthereliabilityobtainedacrosssamples.(d)Rangeofreliabilityvalues239
obtainedacrosssamples.240
ChangeDetectionReliability15
Figure3.Spearman-Browncorrectedsplit-halfreliabilityestimatesforeachset-sizein241
Experiment1.Top3panels:Averagereliabilityforeachset-size.Bottom3panels:Standard242
Deviationofthereliabilityforeachset-sizeacross100downsamplingiterations.243
244
ANoteforFixedCapacity+AttentionEstimatesofCapacity245
Sofar,wehavediscussedonlythemostcommonlyusedmethodsofestimating246
workingmemorycapacity(Kscoresandpercentcorrect).Othermethodsofestimating247
capacityhavebeenused,andwewouldliketobrieflymentiononeofthem.Rouderand248
colleagues(2008)suggestedaddinganattentionallapseparametertoestimatesofvisual249
workingmemorycapacity,amodelreferredtoasFixedCapacity+Attention.Addingan250
attentionallapseparameteraccountsfortrialswheresubjectsareinattentivetothetaskat251
hand.Specifically,participantscommonlymakeerrorsontrialsthatshouldbewellwithin252
capacitylimits(e.g.set-size1),andaddingalapseparametercanhelptoexplainthese253
anomalousdipsinperformance.UnliketypicalestimatesofcapacityinwhichaKvalueis254
ChangeDetectionReliability16
computeddirectlyforperformanceforeachset-sizeandthenaveraged,thismodelusesa255
log-likelihoodestimationtechniquethatestimatesasinglecapacityparameterby256
simultaneouslyconsideringperformanceacrossallset-sizesand/orchangeprobability257
conditions.Critically,thismodelassumesthatdataisobtainedforatleastonesub-capacity258
set-size,andthatanyerrormadeonthisset-sizereflectsanattentionallapse.Ifthemodel259
isfittodatathatlacksatleastonesub-capacityset-size(e.g.1or2items),thenthemodel260
willfitpoorlyandprovidenonsensicalparameterestimates.261
Recently,VanSnellenbergandcolleaguesusedtheFixedCapacity+AttentionModel262
tocalculatecapacityforachangedetectiontask,andtheyfoundthatthereliabilityofthe263
model’scapacityparameterwaslow(r=.35),anddidnotcorrelatewithotherworking264
memorytasks(VanSnellenberg,Conway,Spicer,Read,&Smith,2014).Critically,however,265
thisstudyusedonlyrelativelyhighset-sizes(4and8),andlackedasub-capacityset-size,266
somodelfitswerelikelypoor.UsingcodemadeavailablefromRouderetal.,wefitaFixed267
Capacity+Attentionmodeltoourdata(Rouder,n.d.).Wefoundthatwhenthismodelis268
misapplied(i.e.usedondatawithoutatleast1sub-capacityset-size)theinternalreliability269
ofthecapacityparameterwaslow(runcorrected=.35),andnegativelycorrelatedwith270
rawchangedetectionaccuracy,r=-.25,p=.004.Ifwehadonlyappliedthismodeltoour271
data,wewouldhavemistakenlyconcludedthatchangedetectionmeasuresofferpoor272
reliabilityanddonotcorrelatewithothermeasuresofworkingmemorycapacity.273
Discussion274
Here,wehaveshownthatwhensufficientnumbersoftrialsandsubjectsare275
collected,thereliabilityofchangedetectioncapacityisremarkablyhigh(r>.9).Onthe276
otherhand,asystematicdownsamplingmethodrevealedthatinsufficienttrialsor277
ChangeDetectionReliability17
insufficientsubjectnumberscoulddramaticallyreducethereliabilityobtainedinasingle278
experiment.Ifresearchershopetomeasurethecorrelationbetweenvisualworking279
memorycapacityandsomeothermeasure,Figures1and2canserveasanapproximate280
guidetoexpectedreliability.Becauseweonlyhadasinglesampleofthelargestn(137),we281
cannotmakedefinitiveclaimsaboutthereliabilityoffuturesamplesofthissize.However,282
giventhestabilizationofcorrelationcoefficientswithlargesamplesizesandtheextremely283
highcorrelationcoefficientobtained,wecanberelativelyconfidentthatthereliability284
estimateforourfullsample(n=137)wouldnotchangesubstantiallyinfuturesamplesof285
universitystudents.Further,wecanmakeclaimsabouthowthereliabilityofsmall,well-286
definedsub-samplesofthis“population”cansystematicallydeviatefromanempirical287
upperbound.288
Theaveragecapacityobtainedforthissamplewasslightlylowerthansomeother289
valuesintheliterature,typicallycitedasaround3-4items.Theslightlyloweraveragefor290
thissamplecouldpotentiallycausesomeconcernaboutthegeneralizabilityofthese291
reliabilityvaluesforfuturesamples.Forthecurrentmanuscript’ssample,averageK-scores292
forset-sizes4and8wereK=2.3andK=2.0,respectively.Thelargest,mostcomparable293
sampletothepresentsampleisa495subjectsampleinworkbyFukuda,Woodman,and294
Vogel(2015).TheaverageK-scoresforset-size4and8wereK=2.7andK=2.4,295
respectively,andthetaskdesignwasnearlyidentical(150msencodingtime,1000ms296
retentioninterval,nocolorrepetitionsallowed,andset-sizes4and8).Thedifferenceof0.3297
–0.4itemsbetweenthesetwosamplesisrelativelysmall,thoughlikelysignificant.298
However,forthepurposesofestimatingreliability,thevarianceofthedistributionismore299
importantthanthemean.Thevariabilityobservedinthepresentsample(SD=0.7forset-300
ChangeDetectionReliability18
size4,SD=.97forset-size8)wasverysimilartothatobservedintheFukudaetal.sample301
(SD=0.6forset-size4andSD=1.2forset-size8),thoughunfortunatelytheFukudaeal.302
studydidnotreportreliability.Becauseofthenearlyidenticalvariabilityofscoresacross303
thesetwosamples,wecaninferthatourreliabilityresultswouldindeedgeneralizeto304
otherlargesamplesforwhichchangedetectionscoreshavebeenobtained.305
Werecommendapplyinganiterativedownsamplingapproachtoothermeasures306
whereexpediencyoftaskadministrationisvalued,butreliabilityisparamount.Thestats-307
savvyreadermaynotethattheSpearman-Brownprophecyformulaalsoallowsoneto308
calculatehowmanyobservationsmustbeaddedtoimproveexpectedreliability,according309
totheformula:310
𝑁 =#∗%%&(()#%%&)
#%%&(()#∗%%&)311
Where𝜌 ∗--& isthedesiredcorrelationstrength,𝜌--& istheobservedcorrelationandNis312
thenumberoftimesthattestlengthmustbemultipliedtoachievethedesiredcorrelation313
strength.Critically,however,thisformuladoesnotaccountfortheaccuracyoftheobserved314
correlation.Thus,ifonestartsfromanunreliablecorrelationcoefficientobtainedwitha315
smallnumberofsubjectsandtrials,onewillobtainanunreliableestimateofthenumberof316
observationsneededtoimprovecorrelationstrength.Inexperimentssuchasthisone,both317
numberoftrialsandnumberofsubjectswilldrasticallychangeestimatesofthenumberof318
subjectsneededtoobservecorrelationsofadesiredstrength.319
Let’stakeanexamplefromouriterativedownsamplingprocedure.Imaginethatwe320
ran100experiments,eachwith15subjectsand150totaltrialsofchangedetection.Doing321
so,wewouldobtain100differentestimatesofthestrengthofthetruesplit-half322
correlation.WecouldthenapplytheSpearman-Brownformulatoeachofthese100323
ChangeDetectionReliability19
estimatesinordertocalculatethenumberoftrialsneededtoobtainadesiredreliabilityof324
r=.8.Sodoing,wewouldfindthat,onaverage,wewouldneedaround140trialstoobtain325
thedesiredreliability.However,becauseofthelargevariabilityintheobservedcorrelation326
strength(r=.37to.97),ifwehadonlyrunthe“bestcase”experiment(r=.97),wewould327
estimatethatweneedonly18trialstoobtainourdesiredreliabilityofr=.8with15328
subjects.Ontheotherhand,ifwehadrunthe“worstcase”experiment(r=.37),thenwe329
wouldestimatethatweneed1,030trials.Therearedownsidestobothtypesofestimation330
errors.Whileapessimisticestimateofthenumberoftrialsneeded(>1000)wouldcertainly331
ensureadequatereliability,thismaycomeatthecostoftimeandparticipants’frustration.332
Conversely,anoverlyoptimisticestimateofthenumberoftrialsneeded(<20)wouldlead333
tounderpoweredstudiesthatwastetimeandfunds.334
Finally,weinvestigatedanalternativeparameterizationofcapacitybasedona335
modelthatassumesafixedcapacityandanattentionlapseparameter(Rouderetal.,2008).336
Critically,thismodelattemptstoexplainerrorsforset-sizesthatarewellwithincapacity337
limits(e.g.1item).Ifresearchersinappropriatelyapplythismodeltochangedetectiondata338
withonlylargeset-sizes,theywoulderroneouslyconcludethatchangedetectiontasks339
yieldpoorreliabilityandfailtocorrelatewithotherestimatesofcapacity(e.g.Van340
Snellenbergetal.,2014).341
InExperiment2,weshiftedourfocustothestabilityofchangedetectionestimates.342
Thatis,howconsistentareestimatesofcapacityfromday-to-day?Wecollectedan343
unprecedentednumberofsessionsofchangedetectionperformance(31)spanning60344
days.Weexaminedthestabilityofcapacityestimates,definedasthecorrelationbetween345
individuals’capacityestimatesfromonedaytothenext.Sincecapacityisthoughttobea346
ChangeDetectionReliability20
stabletraitoftheindividual,wepredictedthatindividualdifferencesincapacityshouldbe347
reliableacrossmanytestingsessions.348
Experiment2349
MaterialsandMethods350
Participants.351
79individuals(male:22;female:57;meanage=22.67years,SD=2.31)with352
normalorcorrected-to-normalvisionparticipatedformonetarycompensation.Thestudy353
wasapprovedbytheEthicsCommitteeofSouthwestUniversity.354
Stimuli355
Someexperimentalsessionswerecompletedinthelabandotherswerecompleted356
inparticipants’homes.Inthelab,stimuliwerepresentedonmonitorswitharefreshrateof357
75Hz.Athome,stimuliwerepresentedonlaptopscreenswithsomewhatvariablerefresh358
ratesandsizes.Inbothcases,participantssatapproximately60cmfromthescreen,though359
achinrestwasnotusedsoallvisualangleestimatesareapproximate.Inthelab,therewere360
somesmallvariationsinmonitorsize(five18.5”LCDmonitors,one19”LCDmonitor)in361
testingrooms,leadingtosmallvariationsinthesizeofthecoloredsquares.Detailsare362
providedabouttheapproximaterangeindegreesofvisualangleinthelab.363
AllstimuliweregeneratedinMATLAB(TheMathWorks,Natick,MA)using364
Psychophysicstoolbox.Coloredsquares(51pixels;rangeof1.28oto1.46ovisualangle)365
servedasmemoranda.Squarescouldappearanywherewithinanareaofthemonitor366
subtendingapproximately14.4oto14.8odegreeshorizontallyand8.1oto8.4odegrees367
vertically.Squarescouldappearinanyofninedistinctcolors(RGBvalues:Red=25500;368
Green=02550;Blue=00255;Magenta=2550255;Yellow=2552550;Cyan=0255255;369
ChangeDetectionReliability21
Orange=2551280;White=255255255;Black=000).Colorsweresampledwithout370
replacementforset-size4andset-size6trials.Eachcolorcouldberepeatedupto1timein371
set-size8trials(i.e.colorsweresampledfromalistof18colors,witheachofthe9unique372
colorsappearingtwice).Participantswereinstructedtofixateasmallblackdot(~.3ovisual373
angle)atthecenterofthedisplay.374
Procedures375
TrialproceduresforthechangedetectiontaskwereidenticaltoExperiment1.376
Participantscompletedatotalof31sessionsofthechangedetectiontask.Ineachsession,377
participantscompletedatotalof120trials(splitover5blocks).Therewere40trialseach378
ofset-sizes4,6,and8.Participantswereaskedtofinishthechangedetectiontaskoncea379
dayfor30consecutivedays.Theycoulddothistaskontheirowncomputersoronthe380
experimenters’computersthroughouttheday.Participantswereinstructedthatthey381
shouldcompletethetaskinarelativelyquietenvironmentandnotdoanythingelse(e.g.382
talkingtoothers)atthesametime.Experimentersremindedtheparticipantstofinishthe383
taskandcollectedthedatafileseveryday.384
Results385
DescriptiveStatistics386
DescriptivestatisticsforaverageKvaluesacrossthe31sessionsareshowninTable387
2.Acrossallsessions,theaveragecapacitywas2.83(SD=.23).Changeinmeancapacity388
overtimeisshowninFigure4A.ArepeatedmeasuresANOVArevealedasignificant389
differenceincapacityacrosssessions,F(18.76,1388.38)1=15.04,p<.001,hp2=.169.390
Subjects’performanceinitiallyimprovedacrosssessions,thenleveledoff.Thegroup-391
1Greenhouse-GeisservaluesreportedwhenMauchly’sTestofSphericityisviolated.
ChangeDetectionReliability22
averageincreaseincapacityovertimeiswell-describedbyatwo-termexponentialmodel392
(SSE=.08,RMSE=.06,AdjustedR2=.94),describedbytheequation:𝑦 = 2.776×𝑒.556- −393
.798×𝑒).:;- .Totesttheimpressionthatindividuals’improvementslowedovertime,we394
fitseveralgrowthcurvemodelstothedatausingMaximumLikelihoodEstimation395
(‘fitmle.m’)withSubjectenteredasarandomfactor.Wecodedtimeasdaysfromthefirst396
session(Session1=0).ModelAincludedonlyarandomintercept;ModelBincludeda397
randominterceptandarandomlineareffectoftime;ModelCaddedinaquadraticeffectof398
time,andModelDaddedacubiceffectoftime.AsshowninTable3,thequadraticmodel399
providedthebestfittothedata.Furthertestingrevealedthatbothrandomslopesand400
interceptswereneededtobestfitthedata(Table4,ModelsC1-C4).Thatis,participants401
startedoutwithdifferentbaselinecapacityvalues,andtheyimprovedatdifferentrates.402
However,thecovariancematrixforModelCrevealedthattherewasnosystematic403
relationshipbetweeninitialcapacity(intercept)andeitherthelineareffectoftime,r=.21,404
95%CI[-.10.49],orthequadraticeffectoftime,r=-.14,95%CI[-.48.24].Thissuggests405
thattherewasnomeaningfulrelationshipbetweenaparticipant’sinitialcapacityandtheir406
rateofimprovement.Tovisualizethispoint,wedidaquartilesplitofsession1407
performance,andthenplottedthechangeforeachofeachgroup(Figure4). 408
ChangeDetectionReliability23
Figure4.Averagecapacity(K)acrosstestingsessions.Shadedbarsrepresentstandarderror409
ofthemean.Note,theaxisissplicedbetweendays30and60,asnointerveningdatapoints410
werecollectedduringthistimeLeft:Averagechangeinperformanceovertime.Right:411
Averagechangeinperformanceovertimeforeachquartileofsubjects(quartilesplit412
performedondatafromsession1).413
414
415
416
417
418
419
420
ChangeDetectionReliability24
N Mean SD Minimum Maximum Kurtosis SkewnessDay1 79 2.15 0.85 0.40 4.03 -0.69 0.24Day2 79 2.36 0.86 0.07 3.97 -0.24 -0.32Day3 79 2.43 0.82 0.80 4.07 -0.62 -0.29Day4 78 2.51 0.85 0.40 4.10 -0.31 -0.31Day5 79 2.52 0.93 0.57 4.27 -0.55 -0.13Day6 79 2.74 0.92 0.53 4.60 -0.39 -0.20Day7 79 2.73 0.91 0.67 4.63 -0.88 -0.09Day8 79 2.66 0.87 1.03 4.70 -0.66 0.06Day9 79 2.81 0.92 0.50 5.07 -0.18 -0.19
Day10 79 2.86 0.94 0.77 4.70 -0.84 0.01Day11 78 2.79 0.94 0.40 4.27 -0.51 -0.55*Day12 79 2.83 1.01 -0.10 4.80 -0.38 -0.37Day13 78 2.85 0.96 0.37 4.80 -0.57 -0.21Day14 79 3.01 0.95 0.93 5.03 -0.46 -0.11Day15 78 2.85 0.92 0.37 4.37 0.12 -0.73*Day16 79 2.91 0.92 0.23 4.90 -0.05 -0.35Day17 79 2.84 0.90 0.87 4.77 -0.51 -0.18Day18 79 2.93 1.02 0.53 4.73 -0.40 -0.23Day19 79 2.90 0.92 0.87 4.57 -0.69 -0.24Day20 79 2.94 0.92 0.47 4.93 -0.03 -0.32Day21 79 2.98 0.94 0.80 4.90 -0.08 -0.47Day22 79 2.99 0.98 0.83 4.90 -0.65 -0.23Day23 79 2.86 1.05 0.23 5.47 -0.17 -0.14Day24 78 3.00 0.98 0.97 4.77 -0.74 -0.26Day25 79 3.04 0.95 0.67 5.03 -0.41 -0.16Day26 79 3.01 0.93 0.43 5.07 -0.28 -0.34Day27 79 3.09 1.06 0.43 5.00 -0.51 -0.29Day28 79 3.04 0.97 0.33 4.83 -0.22 -0.48Day29 79 3.01 1.04 0.77 5.07 -0.38 -0.33Day30 79 3.02 1.05 0.33 5.00 -0.48 -0.29Day60 79 3.00 1.08 -0.13 5.40 0.29 -0.58*
421
Table2.DescriptivestatisticsforExperiment2.Descriptivestatisticsareshownseparately422
foreachset-sizeandfortheaverageofthethreeset-sizes.Kurtosisandskewnessvalues423
arebothcenteredaround0.Asterisksdenotecredibledeviationfromnormality(Cramer,424
1997).425
ChangeDetectionReliability25
Table3.ComparisonofLinear,Quadratic,andCubicgrowthmodels,allwithrandom426
interceptsandslopeswhereapplicable.427
ModelA:InterceptOnly
ModelB:Linear
ModelC:Quadratic
ModelD:Cubic
Intercept 2.83*** 2.60*** 2.41*** 2.29***LinearSlope 0.014*** .037*** .07**QuadraticSlope
-.0005*** -.002*
CubicSlope 2x10-5n.s.-2LL 4366.2 4084.8 3914.7 4231.6BIC 4389.6 4131.6 3992.7 4348.6***p<.001**p<.01*p<.05Table4.Comparisonoffixedversusrandomslopesandintercept.428 ModelC1:
FixedInt.FixedSlope
ModelC2:FixedInt.RandomSlope
ModelC3:RandomInt.FixedSlope
ModelC4:RandomInt.RandomSlope
-2LL 6672.3 4627.7 4009.1 3914.7BIC 6703.5 4682.3 4048.1 3992.7
ChangeDetectionReliability26
Within-sessionreliability429
Within-sessionreliabilitywasassessedusingCronbach’salphaandsplit-half430
correlations.Cronbach’salpha(usingsingle-trialaccuracyasitems)yieldedanaverage431
within-sessionreliabilityofa=.76(SD=.04,Min.=.65,Max.=.83).Equivalently,spit-half432
correlationsonK-scorescalculatedfromevenversusoddtrialsrevealedaverage433
Spearman-Browncorrectedreliabilityofr=.76(SD=.05,Min.=.62,Max.=.84).Asin434
Experiment1,usingrawerror(Cronbach’salpha)versusbiasadjustedcapacitymeasures435
(Cowan’sK)didnotaffectreliabilityestimates.Within-sessionreliabilityincreasedslightly436
overtime(Figure5).Cronbach’salphavalueswerepositivelycorrelatedwithsession437
number(1-31),r=.82,p<.001,95%CI[.66,.91],asweresplit-halfcorrelationvalues,r=438
.67,p<.001,95%CI[.41,.83]. 439
ChangeDetectionReliability27
Figure5.Changeinwithin-sessionreliabilityacrosssessionsinExperiment2.Therewasa440
significantpositiverelationshipbetweensessionnumber(1:31)andinternalreliability.441
ChangeDetectionReliability28
Between-sessionstability442
Wefirstassessedstabilityovertimebycomputingcorrelationcoefficientsforall443
pairwisecombinationsofsessions(465totalcombinations).Missingsessionswere444
excludedfromthecorrelations,meaningthatsomepairwisecorrelationsincluded78445
subjectsinsteadof79(seeTable2).Allsessionscorrelatedwitheachother,meanr=.71446
(SD=.06,Min.=.48,Max.=.86,allp-values<.001).Aheatmapofallpairwisecorrelations447
isshowninFigure6.Themosttemporallydistantsessionsstillcorrelatedwitheachother.448
ThecorrelationbetweenDay1andDay30(28interveningsessions)wasr=.53,p<.001,449
95%CI[.35,.67];thecorrelationbetweenDay30andDay60(0interveningsessions)was450
r=.81,p<.001,95%CI[.72,.88];thecorrelationbetweenDay1andDay60wasr=.59,p<451
.001,95%CI[.41,.71].Finally,weobservedthatbetween-sessionstabilityincreasedover452
time,likelyduetoincreasedinternalreliabilityacrosssessions.Tocomputechangein453
reliabilityovertime,wecalculatedthecorrelationcoefficientfortemporallyadjacent454
sessions(e.g.thecorrelationofsession1andsession2,ofsession2andsession3,etc.).The455
averageadjacent-sessioncorrelationwasr=.76(SD=.05,Min.=.64,Max.=.86),andthe456
strengthofadjacent-sessioncorrelationswaspositivelycorrelatedwithsessionnumber,r457
=.68,p<.001,indicatinganincreaseinstabilityovertime.458
459
ChangeDetectionReliability29
Figure6.Correlationsbetweensessions.Left:Correlationsbetweenallpossiblepairsof460
sessions.Colorrepresentsthecorrelationcoefficientofthecapacityestimatesfromeach461
possiblepairwisecombinationofthe31sessions.Allcorrelationvaluesweresignificant,p462
<.001.Right:Illustrationofthesessionsthataremostdistantintime:Day1correlated463
withDay30(28interveningsessions)andDay30correlatedwithDay60(nointervening464
sessions).465
ChangeDetectionReliability30
Differencesbytestinglocation466
Wetestedforsystematicdifferencesinperformance,reliability,andstabilityfor467
sessionscompletedathomeversusinthelab.Intotal,therewere41subjectswho468
completedalloftheirsessionsintheirownhome(“homegroup”),27subjectswho469
completedalloftheirsessionsinthelab(“labgroup”),and11subjectswhocompleted470
somesessionsathomeandsomeinthelab(“mixedgroup”).471
Acrossall31sessions,subjectsinthehomegrouphadanaveragecapacityof2.67472
(SD=1.01),thoseinthelabgrouphadanaveragecapacityof3.01(SD=.83)andthosein473
themixedgrouphadanaveragecapacityof2.98(SD=1.04).Onaverage,scoresfor474
sessionsinthehomegroupwereslightlylowerthanscoresforsessionsinthelabgroup,475
t(2101)=-7.98,p<.001,95%CI[-.42,-.25].Scoresforsessionsinthemixedgroupwere476
higherthanforsessionsinthehomegroup,t(1606)=5.0,p<.001,95%CI[.19,.43],but477
werenotdifferentfromthelabgroup,t(1175)=.44,p=.67,95%CI[-.09,.14].478
Interestingly,however,apairedt-testforthemixedgroup(n=11)revealedthatthesame479
subjectsperformedslightlybetterinthelab(M=3.08)andslightlyworseathome,M=480
2.85,t(10)=3.15,p=.01,95%CI[.07,.39].481
Cronbach’salphaestimatesofwithin-sessionreliabilitywereslightlyhigherfor482
sessionscompletedathome(Meana=.76,SD=.05)comparedtosessionscompletedin483
thelab(Meana=.69,SD=.058),t(60)=3.75,p<.001,95%CI[.03.10].Likewise,484
Spearman-BrownCorrectedcorrelationcoefficientswerehigherforsessionscompletedat485
home(Meanr=.79,SD=.07)comparedtointhelab(Meanr=.67,SD=.14),t(60)=4.42,p486
<.001,95%CI[.07,.18].However,thesedifferencesinreliabilitymayresultfrom(1)487
unequalsamplesizesbetweenlabandhomeor(2)unequalaveragecapacitybetween488
ChangeDetectionReliability31
groups(3)unequalvariabilitybetweengroups.Onceequatingsamplesizebetweengroups489
andmatchingsamplesforaveragecapacity,differencesinreliabilitywerenolongerstable:490
Acrossiterationsofmatchedsamples,differencesinCronbach’sarangedfromp<.01top491
>.5,anddifferencesinsplit-halfcorrelationsignificancerangedfromp<.01top>.25.492
Next,weexamineddifferencesinstabilityforsessionscompletedathomecompared493
tointhelab.Onaverage,test-retestcorrelationswerehigherforhomesessions(Meanr=494
.72,SD=.08)comparedtolabsessions(Meanr=.67,SD=.10),t(928)=8.01,p<.001,95%495
CI[.04.06].Again,howeverdifferencesintest-retestcorrelationswerenotreliableafter496
matchingsamplesizeandaveragecapacity,differencesincorrelationsignificanceranged497
fromp=.01top=.98.498
Discussion499
Withextensivepracticeovermultiplesessions,weobservedimprovementinoverall500
changedetectionperformance.Thisimprovementwasmostpronouncedoverearly501
sessions,afterwhichmeanperformancestabilizedfortheremainingsessions.Theinternal502
reliabilityofthefirstsession(SpearmanBrowncorrectedr=.71,Cronbach’sa=.67)was503
withintherangepredictedbythelook-uptablecreatedinExperiment1for80subjectsand504
120trials(predictedrange:r=.61to.87anda=.58to.80,respectively).Bothreliability505
andstabilityremainedhighoverthespanof60days.Infact,reliabilityandstability506
increasedslightlyacrosssessions.Animportantconsiderationforanycognitivemeasureis507
whetherornotrepeatedexposuretothetaskwillharmthereliabilityofthemeasure.For508
example,re-exposuretothesamelogicpuzzleswilldrasticallyreducetheamountoftime509
neededtosolvethepuzzlesandinflateaccuracy.Thus,forsuchtasksgreatcaremustbe510
takentogeneratenoveltestversionstobeadministeredatdifferentdates.Similarly,over-511
ChangeDetectionReliability32
practiceeffectscouldleadtoasharpdecreaseinvariabilityofperformance(e.g.ceiling512
effects,flooreffects),whichwouldbydefinitionleadtoadecreaseinreliability.Here,we513
demonstratedthatwhilecapacityestimatesincreasewhensubjectsarefrequentlyexposed514
toachangedetectiontask,thereliabilityofthemeasureisnotcompromisedbypractice515
effectsorceilingeffects.516
Wealsoexaminedwhetherreliabilitywasharmedforparticipantswhocompleted517
thechangedetectionsessionsintheirownhomescomparedtothelab.Whileremotedata518
collectionsacrificessomedegreeofexperimentalcontrol,theuseofat-hometestsis519
becomingmorecommonwiththeeaseofremotedatacollectionthroughresourceslike520
Amazon’sMechanicalTurk(Mason&Suri,2012).Reliabilitywasnotnoticeablydisrupted521
bynoisearisingfromsmalldifferencesinstimulussizebetweendifferenttesting522
environments.Aftercontrollingfornumberofsubjectsandcapacity,therewasnolongera523
consistentdifferenceinreliabilityorstabilityforsessionscompletedathomecomparedto524
inthelab.However,capacityestimatesobtainedinsubjects’homesweresignificantly525
lowerthanthoseobtainedinthelab.Largersamplesizesareneededtomorefully526
investigatesystematicdifferencesincapacityandreliabilitybetweentestingenvironments.527
GeneralDiscussion528
InExperiment1,wedevelopedanovelapproachforestimatingexpectedreliability529
infutureexperiments.Wecollectedchangedetectiondatafromalargenumberofsubjects530
andtrials,andthenweusedaniterativedownsamplingproceduretoinvestigatetheeffect531
ofsamplesizeandtrialnumberonreliability.Averagereliabilityacrossiterationswas532
fairlyimpervioustothenumberofsubjects.Instead,averagereliabilityestimatesacross533
iterationsreliedmoreheavilyonthenumberoftrialspersubject.Ontheotherhand,the534
ChangeDetectionReliability33
variabilityofreliabilityestimatesacrossiterationswashighlysensitivetothenumberof535
subjects.Forexample,withonly10subjects,theaveragereliabilityestimateforan536
experimentwith150trialswashigh(α=.75)buttheworstiteration(akintotheworst537
expectedexperimentoutof100)gaveapoorreliabilityestimate(α=.42).Ontheother538
hand,therangebetweenthebestandworstreliabilityestimatesdecreaseddramaticallyas539
thenumberofsubjectsincreased.With40subjects,theminimumobservedreliabilityfor540
150trialswasα=.65.541
InExperiment2,weexaminedthereliabilityandstabilityofchangedetection542
capacityestimatesacrossanunprecedentednumberoftestingsessions.Subjects543
completed31sessionsofsingle-probechange-detection.Thefirst30sessionstookplace544
over30consecutivedays,andthelastsessiontookplace30dayslater(Day60).Average545
internalreliabilityforthefirstsessionwasintherangepredictedbythelook-uptablein546
Experiment1.Despiteimprovementsinperformanceacrosssessions,between-subject547
variabilityinKremainedstableovertime(averagetest-retestbetweenall31sessionswas548
r=.76;thecorrelationforthetwomostdistantsessions,Day1andDay60,wasr=.59).549
Interestingly,bothwithin-sessionreliabilityandbetween-sessionreliabilityincreased550
acrosssessions.Ratherthandiminishingduetopractice,reliabilityofWMCestimates551
increasedacrossmanysessions.552
Thepresentworkhasimplicationsforplanningstudieswithnovelmeasuresandfor553
justifyingtheinclusionofexistingmeasuresintoclinicalbatteriessuchastheResearch554
DomainCriteria(RDoC)project(Cuthbert&Kozak,2013;Rodebaughetal.,2016).For555
basicresearch,aninternalreliabilityof0.7isconsideredasufficient“ruleofthumb”for556
investigatingcorrelationalrelationshipbetweenmeasures(Nunnally,1978).Whilethis557
ChangeDetectionReliability34
levelofreliability(orevenlower)willallowresearcherstodetectcorrelations,itisnot558
sufficienttoconfidentlyassessthescoresofindividuals.Forthat,reliabilityinexcessof.9559
oreven.95isdesirable(Nunnally,1978).Here,wedemonstratehowthenumberoftrials560
canalterthereliabilityofworkingmemorycapacityestimates;withrelativelyfewtrials561
(~150,around10minutesoftasktime),changedetectionestimatesaresufficientlyreliable562
forcorrelationstudies(α~.8),butmanymoretrialsareneeded(~500)toboostreliability563
tothelevelneededtoassessindividuals(α~.9).Anotherimportantconsiderationfora564
diagnosticmeasureisitsreliabilityacrossmultipletestingsessions.Sometaskslosetheir565
diagnosticvalueonceindividualshavebeenexposedtothemonceortwice.Herewe566
demonstratethatchangedetectionestimatesofworkingmemorycapacityarestable,even567
whenparticipantsarewell-practicedonthetask(3,720trialsover31sessions).568
Onechallengeinestimatingthe“true”reliabilityofacognitivetaskisthatreliability569
dependsheavilyonsamplecharacteristics.Aswehavedemonstrated,varyingthesample570
sizeandnumberoftrialscanyieldverydifferentestimatesofthereliabilityforaperfectly571
identicaltask.Othersamplecharacteristicscanlikewiseaffectreliability;themostnotable572
oftheseissamplehomogeneity.Thesampleusedherewasalargesampleofuniversity573
students,withafairlywiderangeincapacities(approximately0.5–4items).Samples574
usingonlyasubsetofthiscapacityrange(e.g.clinicalpatientgroupswithverylow575
capacity)willbelessinternallyreliablebecauseoftherestrictedrangeofthesub-576
population.Indeed,inExperiment1wefoundthatsamplingiterationswithpoorreliability577
tendedtohavelowervariabilityandasmallerrangeofscores.Thus,carefullyrecording578
samplesize,mean,standarddeviation,andinternalreliabilityinallexperimentswillbe579
criticalforassessingandimprovingthereliabilityofstandardizedtasksusedforcognitive580
ChangeDetectionReliability35
research.Intheinterestofreplicability,opensourcecoderepositories(e.g.theExperiment581
Factory)havesoughttomakestandardizedversionsofcommoncognitivetasksbetter-582
categorized,open,andeasilyavailable(Sochatetal.,2016).However,onepotential583
weaknessfortaskrepositoriesisalackofdocumentationaboutexpectedinternal584
reliability.Standardizationoftaskscanbeveryuseful,butitshouldnotbeover-applied.In585
particular,experimentswithdifferentgoalsshouldusedifferenttestlengthsthatbestsuit586
thegoalsoftheexperimentalquestion.WefeelthatprojectssuchastheExperiment587
Factorywillcertainlyleadtomorereplicablescience,andincludingestimatesofreliability588
withtaskcodecouldhelptofurtherthisgoal.589
Finally,theresultspresentedherehaveimplicationsforresearcherswhoare590
interestedindifferencesbetweenexperimentalconditionsandnotindividualdifferences591
perse.Trialnumberandsamplesizewillaffectthedegreeofmeasurementerrorforeach592
conditionusedwithinchangedetectionexperiments(e.g.set-sizes,distractorpresence,593
etc.).Todetectsignificantdifferencesbetweenconditionsandavoidfalsepositives,it594
wouldbedesirabletoestimatethenumberoftrialsneededtoensureadequateinternal595
reliabilityforeachconditionofinterestwithintheexperiment.Insufficienttrialnumbersor596
samplesizescanleadtointolerablylowinternalreliability,andcouldspoilanotherwise597
well-plannedexperiment.598
TheresultsofExperiments1and2revealedthatchangedetectioncapacity599
estimatesofvisualworkingmemorycapacityarebothinternallyreliableandstableacross600
manytestingsessions.Thisfindingisconsistentwithpreviousstudiesshowingthatother601
measuresofworkingmemorycapacityarereliableandstable,includingcomplexspan602
measures(Beckmann,Holling,&Kuhn,2007;Fosteretal.,2015;Klein&Fiss,1999;Waters603
ChangeDetectionReliability36
&Caplan,1996)andthevisuospatialn-back(Hockey&Geffen,2004).Themainanalyses604
fromExperiment1suggestconcreteguidelinesfordesigningstudiesthatrequirereliable605
estimatesofchangedetectioncapacity.Whenbothsamplesizeandtrialnumberswere606
high,thereliabilityofchangedetectionwasquitehigh(α>.9).However,studieswith607
insufficientsamplesizesornumberoftrialsfrequentlyhadlowinternalreliability.608
Consistentwiththenotionthatworkingmemorycapacityisastabletraitoftheindividual,609
individualdifferencesincapacityremainedstableovermanysessionsinExperiment2610
despitepractice-relatedperformanceincreases.611
Boththeeffectsoftrialnumberandsamplesizeareimportanttoconsider,and612
researchersshouldbecautiousaboutgeneralizingexpectedreliabilityacrossvastly613
differentsamplesizes.Forexample,inarecentpaperbyFosterandcolleagues(2015),the614
authorsfoundthatcuttingthenumberofcomplexspantrialsbytwo-thirdshadonlya615
modesteffectonthestrengthofthecorrelationbetweenworkingmemorycapacityand616
fluidintelligence.Critically,however,theauthorsusedaround500subjects,andsucha617
largesamplesizewillactasabufferagainstincreasesinmeasurementerror(i.e.fewer618
trialspersubject).Readerswishingtoconductanewstudywithasmallersamplesize(e.g.619
50subjects)wouldbeill-advisedtodramaticallycuttrialnumbersbasedonthisfinding620
alone;asdemonstratedinExperiment1,cuttingtrialnumbersleadstogreatervolatilityof621
reliabilityvaluesforsmallsamplesizesrelativetolargeones.Givenpresentconcernsabout622
powerandreplicabilityinpsychologicalresearch(OpenScienceCollaboration,2015),we623
suggestthatrigorousestimationoftaskreliability,consideringbothsubjectandtrial624
numbers,willbeusefulforplanningbothnewstudiesandreplicationefforts.625
626
ChangeDetectionReliability37
References
Beckmann,B.,Holling,H.,&Kuhn,J.-T.(2007).Reliabilityofverbal–numericalworking627
memorytasks.PersonalityandIndividualDifferences,43(4),703–714.628
https://doi.org/10.1016/j.paid.2007.01.011629
Brown,W.(1910).Someexperimentalresultsinthecorrelationofmentalabilities.British630
JournalofPsychology,1904-1920,3(3),296–322.https://doi.org/10.1111/j.2044-631
8295.1910.tb00207.x632
Buschman,T.J.,Siegel,M.,Roy,J.E.,&Miller,E.K.(2011).Neuralsubstratesofcognitive633
capacitylimitations.ProceedingsoftheNationalAcademyofSciences,108(27),634
11252–11255.https://doi.org/10.1073/pnas.1104666108635
Cowan,N.(2001).Themagicalnumber4inshort-termmemory:areconsiderationof636
mentalstoragecapacity.TheBehavioralandBrainSciences,24(1),87-114-185.637
https://doi.org/10.1017/S0140525X01003922638
Cowan,N.,Fristoe,N.M.,Elliott,E.M.,Brunner,R.P.,&Saults,J.S.(2006).Scopeof639
attention,controlofattention,andintelligenceinchildrenandadults.Memory&640
Cognition,34(8),1754–1768.https://doi.org/10.3758/BF03195936641
Cramer,D.(1997).Basicstatisticsforsocialresearch :step-by-stepcalculationsand642
computertechniquesusingMinitab.London ;NewYork:Routledge.643
Cronbach,L.J.(1951).Coefficientalphaandtheinternalstructureoftests.Psychometrika,644
16(3),297–334.https://doi.org/10.1007/BF02310555645
Cuthbert,B.N.,&Kozak,M.J.(2013).Constructingconstructsforpsychopathology:The646
NIMHresearchdomaincriteria.JournalofAbnormalPsychology,122(3),928–937.647
https://doi.org/10.1037/a0034028648
ChangeDetectionReliability38
Elmore,L.C.,Magnotti,J.F.,Katz,J.S.,&Wright,A.A.(2012).Changedetectionbyrhesus649
monkeys(Macacamulatta)andpigeons(Columbalivia).JournalofComparative650
Psychology,126(3),203–212.https://doi.org/10.1037/a0026356651
Engle,R.W.,Tuholski,S.W.,Laughlin,J.E.,&Conway,A.R.(1999).Workingmemory,short-652
termmemory,andgeneralfluidintelligence:alatent-variableapproach.Journalof653
ExperimentalPsychology.General,128(3),309–331.654
Foster,J.L.,Shipstead,Z.,Harrison,T.L.,Hicks,K.L.,Redick,T.S.,&Engle,R.W.(2015).655
Shortenedcomplexspantaskscanreliablymeasureworkingmemorycapacity.656
Memory&Cognition,43(2),226–236.https://doi.org/10.3758/s13421-014-0461-7657
Fukuda,K.,Vogel,E.,Mayr,U.,&Awh,E.(2010).Quantity,notquality:therelationship658
betweenfluidintelligenceandworkingmemorycapacity.PsychonomicBulletin&659
Review,17(5),673–679.https://doi.org/10.3758/17.5.673660
Fukuda,K.,Woodman,G.F.,&Vogel,E.K.(2015).IndividualDifferencesinVisualWorking661
MemoryCapacity:ContributionsofAttentionalControltoStorage.InP.Jolicoeur,C.662
Lefebvre,&J.Martinez-Trujillo(Eds.),MechanismsofSensoryWorkingMemory:663
AttentionandPerformanceXXV(pp.105–120).Elsevier.Retrievedfrom664
http://linkinghub.elsevier.com/retrieve/pii/B9780128013717000090665
Gibson,B.,Wasserman,E.,&Luck,S.J.(2011).Qualitativesimilaritiesinthevisualshort-666
termmemoryofpigeonsandpeople.PsychonomicBulletin&Review,18(5),979–667
984.https://doi.org/10.3758/s13423-011-0132-7668
Gold,J.M.,Wilk,C.M.,McMahon,R.P.,Buchanan,R.W.,&Luck,S.J.(2003).Working669
memoryforvisualfeaturesandconjunctionsinschizophrenia.JournalofAbnormal670
Psychology,112(1),61–71.https://doi.org/10.1037/0021-843X.112.1.61671
ChangeDetectionReliability39
Hockey,A.,&Geffen,G.(2004).Theconcurrentvalidityandtest?retestreliabilityofa672
visuospatialworkingmemorytask.Intelligence,32(6),591–605.673
https://doi.org/10.1016/j.intell.2004.07.009674
Johnson,M.K.,McMahon,R.P.,Robinson,B.M.,Harvey,A.N.,Hahn,B.,Leonard,C.J.,…675
Gold,J.M.(2013).Therelationshipbetweenworkingmemorycapacityandbroad676
measuresofcognitiveabilityinhealthyadultsandpeoplewithschizophrenia.677
Neuropsychology,27(2),220–229.https://doi.org/10.1037/a0032060678
Klein,K.,&Fiss,W.H.(1999).ThereliabilityandstabilityoftheTurnerandEngleworking679
memorytask.BehaviorResearchMethods,Instruments,&Computers:AJournalofthe680
PsychonomicSociety,Inc,31(3),429–432.681
Lee,E.-Y.,Cowan,N.,Vogel,E.K.,Rolan,T.,Valle-Inclan,F.,&Hackley,S.A.(2010).Visual682
workingmemorydeficitsinpatientswithParkinson’sdiseaseareduetoboth683
reducedstoragecapacityandimpairedabilitytofilteroutirrelevantinformation.684
Brain,133(9),2677–2689.https://doi.org/10.1093/brain/awq197685
Luria,R.,Balaban,H.,Awh,E.,&Vogel,E.K.(2016).Thecontralateraldelayactivityasa686
neuralmeasureofvisualworkingmemory.Neuroscience&BiobehavioralReviews,687
62,100–108.https://doi.org/10.1016/j.neubiorev.2016.01.003688
Mason,W.,&Suri,S.(2012).ConductingbehavioralresearchonAmazon’sMechanicalTurk.689
BehaviorResearchMethods,44(1),1–23.https://doi.org/10.3758/s13428-011-690
0124-6691
Melby-Lervåg,M.,&Hulme,C.(2013).Isworkingmemorytrainingeffective?Ameta-692
analyticreview.DevelopmentalPsychology,49(2),270–291.693
https://doi.org/10.1037/a0028228694
ChangeDetectionReliability40
Nunnally,J.C.(1978).Psychometrictheory(2ded).NewYork:McGraw-Hill.695
OpenScienceCollaboration.(2015).Estimatingthereproducibilityofpsychological696
science.Science,349(6251),aac4716-aac4716.697
https://doi.org/10.1126/science.aac4716698
Pailian,H.,&Halberda,J.(2015).Thereliabilityandinternalconsistencyofone-shotand699
flickerchangedetectionformeasuringindividualdifferencesinvisualworking700
memorycapacity.Memory&Cognition,43(3),397–420.701
https://doi.org/10.3758/s13421-014-0492-0702
Pashler,H.(1988).Familiarityandvisualchangedetection.Perception&Psychophysics,703
44(4),369–378.https://doi.org/10.3758/BF03210419704
Reinhart,R.M.G.,Heitz,R.P.,Purcell,B.A.,Weigand,P.K.,Schall,J.D.,&Woodman,G.F.705
(2012).HomologousMechanismsofVisuospatialWorkingMemoryMaintenancein706
MacaqueandHuman:PropertiesandSources.JournalofNeuroscience,32(22),707
7711–7722.https://doi.org/10.1523/JNEUROSCI.0215-12.2012708
Rodebaugh,T.L.,Scullin,R.B.,Langer,J.K.,Dixon,D.J.,Huppert,J.D.,Bernstein,A.,…Lenze,709
E.J.(2016).UnreliabilityasaThreattoUnderstandingPsychopathology:The710
CautionaryTaleofAttentionalBias.JournalofAbnormalPsychology.711
https://doi.org/10.1037/abn0000184712
Rouder,J.N.(n.d.).ApplicationsandSourceCode.RetrievedJune22,2016,from713
http://pcl.missouri.edu/apps714
Rouder,J.N.,Morey,R.D.,Cowan,N.,Zwilling,C.E.,Morey,C.C.,&Pratte,M.S.(2008).An715
assessmentoffixed-capacitymodelsofvisualworkingmemory.Proceedingsofthe716
ChangeDetectionReliability41
NationalAcademyofSciencesoftheUnitedStatesofAmerica,105(16),5975–5979.717
https://doi.org/10.1073/pnas.0711295105718
Rouder,J.N.,Morey,R.D.,Morey,C.C.,&Cowan,N.(2011).Howtomeasureworking719
memorycapacityinthechangedetectionparadigm.PsychonomicBulletin&Review,720
18(2),324–330.https://doi.org/10.3758/s13423-011-0055-3721
Shipstead,Z.,Redick,T.S.,&Engle,R.W.(2012).Isworkingmemorytrainingeffective?722
PsychologicalBulletin,138(4),628–654.https://doi.org/10.1037/a0027473723
Sochat,V.V.,Eisenberg,I.W.,Enkavi,A.Z.,Li,J.,Bissett,P.G.,&Poldrack,R.A.(2016).The724
ExperimentFactory:StandardizingBehavioralExperiments.FrontiersinPsychology,725
7.https://doi.org/10.3389/fpsyg.2016.00610726
Spearman,C.(1910).Correlationcalculatedfromfaultydata.BritishJournalofPsychology,727
1904-1920,3(3),271–295.https://doi.org/10.1111/j.2044-8295.1910.tb00206.x728
Todd,J.J.,&Marois,R.(2004).Capacitylimitofvisualshort-termmemoryinhuman729
posteriorparietalcortex.Nature,428(6984),751–754.730
https://doi.org/10.1038/nature02466731
Unsworth,N.,Fukuda,K.,Awh,E.,&Vogel,E.K.(2014).Workingmemoryandfluid732
intelligence:Capacity,attentioncontrol,andsecondarymemoryretrieval.Cognitive733
Psychology,71,1–26.https://doi.org/10.1016/j.cogpsych.2014.01.003734
VanSnellenberg,J.X.,Conway,A.R.A.,Spicer,J.,Read,C.,&Smith,E.E.(2014).Capacity735
estimatesinworkingmemory:Reliabilityandinterrelationshipsamongtasks.736
Cognitive,Affective,&BehavioralNeuroscience,14(1),106–116.737
https://doi.org/10.3758/s13415-013-0235-x738
ChangeDetectionReliability42
Vogel,E.K.,&Machizawa,M.G.(2004).Neuralactivitypredictsindividualdifferencesin739
visualworkingmemorycapacity.Nature,428(6984),748–751.740
https://doi.org/10.1038/nature02447741
Waters,G.S.,&Caplan,D.(1996).Themeasurementofverbalworkingmemorycapacity742
anditsrelationtoreadingcomprehension.TheQuarterlyJournalofExperimental743
Psychology.A,HumanExperimentalPsychology,49(1),51–75.744
https://doi.org/10.1080/713755607745
Wood,G.,Hartley,G.,Furley,P.A.,&Wilson,M.R.(2016).WorkingMemoryCapacity,Visual746
AttentionandHazardPerceptioninDriving.JournalofAppliedResearchinMemory747
andCognition.https://doi.org/10.1016/j.jarmac.2016.04.009748