in press at behavior research methods doi : 10.3758/s13428-017-0886-6 · 2017. 3. 6. · doi :...

Runninghead:CHANGEDETECTIONRELIABILITY

INPRESSATBEHAVIORRESEARCHMETHODS

DOI:10.3758/s13428-017-0886-6

Thereliabilityandstabilityofvisualworkingmemorycapacity

Xu,Z.1*,Adam,K.C.S.2*,Fang,X.1,&Vogel,E.K.2

1SchoolofPsychology,SouthwestUniversity,Chongqing,China2DepartmentofPsychology,UniversityofChicago,Chicago,IL

*Theseauthorscontributedequallytothework.WordCount:7151Figures:6Tables:4Keywords:visualworkingmemory,reliability,changedetectionContributions:Z.X.andE.V.designedtheexperiments;Z.X.andX.F.collecteddata.K.A.performedanalysesanddraftedthemanuscript.K.A.,Z.X.,andE.V.revisedthemanuscript.Acknowledgements:ResearchwassupportedbytheProjectofHumanitiesandSocialSciences,MinistryofEducation,China(15YJA190008),theFundamentalResearchFundsfortheCentralUniversities(SWU1309117),NIHgrant2R01MH087214-06A1andOfficeofNavalResearchgrantN00014-12-1-0972.DatasetsforallexperimentsareavailableonlineonOpenScienceFrameworkathttps://osf.io/g7txf/.ConflictsofInterest:noneCorrespondenceto:KirstenC.S.AdamUniversityofChicago940E57thSt,Chicago,IL60637+1(773)[email protected]

ChangeDetectionReliability2

Abstract1

Becauseofthecentralroleofworkingmemorycapacityincognition,manystudies2

haveusedshortmeasuresofworkingmemorycapacitytoexamineitsrelationshiptoother3

domains.Here,wemeasuredthereliabilityandstabilityofvisualworkingmemory4

capacity,measuredusingasingle-probechangedetectiontask.InExperiment1,subjects5

(N=135)completedalargenumberoftrialsofachangedetectiontask(540intotal,1806

eachofset–sizes4,6,and8).Withlargenumbersoftrialsandsubjects,reliabilityestimates7

werehigh(α>.9).Wethenusedaniterativedownsamplingproceduretocreatealook-up8

tableforexpectedreliabilityinexperimentswithsmallsamplesizes.InExperiment2,9

subjects(N=79)completed31sessionsofsingle-probechange-detection.Thefirst3010

sessionstookplaceover30consecutivedays,andthelastsessiontookplace30dayslater.11

Thisunprecedentednumberofsessionsallowedustoexaminetheeffectsofpracticeon12

stabilityandinternalreliability.Evenaftermuchpractice,individualdifferenceswere13

stableovertime(averagebetween-sessionr=.76).14

15

16

17

18

19

20

21

22

23


WorkingMemoryCapacity(WMC)isacorecognitiveabilitythatpredictsperformance24

acrossmanydomains.Forexample,WMCpredictsattentionalcontrol,fluidintelligenceand25

real-worldoutcomessuchasperceivinghazardswhiledriving(Engle,Tuholski,Laughlin,&26

Conway,1999;Fukuda,Vogel,Mayr,&Awh,2010;Wood,Hartley,Furley,&Wilson,2016).27

Assuch,researchersareofteninterestedindevisingbriefmeasuresofWMCtoinvestigate28

therelationshipofWMCtoothercognitiveprocesses.However,truncatedversionsofWMC29

taskscouldpotentiallybeinadequateforreliablymeasuringanindividual’scapacity.30

Inadequatemeasurementcouldobscurecorrelationsbetweenmeasuresorevendifferences31

inperformancebetweenexperimentalconditions.Furthermore,whileWMCisconsideredto32

beastabletraitoftheobserver,littleworkhasdirectlyexaminedtheroleofextensive33

practiceonthemeasurementofWMCovertime.Thisisofparticularconcernduetothe34

popularityofresearchexaminingwhethertrainingaffectsWMC(Melby-Lervåg&Hulme,35

2013;Shipstead,Redick,&Engle,2012).Extensivepracticeonanygivencognitivetaskhas36

thepotentialtosignificantlyalterthenatureofthevariancethatisdetermining37

performance.Forexample,extensivepracticehasthepotentialtoinducearestrictionof38

rangeproblem,inwhichthebulkoftheobserversreachsimilarperformancelevels-thus39

reducinganyopportunitytoobservecorrelationswithothermeasures.Consequently,a40

systematicstudyofthereliabilityandstabilityofWMCmeasuresiscriticalforimproving41

themeasurementandreproducibilityofmajorphenomenainthisfield.42

Inthepresentstudy,weseektoestablishthereliabilityandstabilityofone43

particularWMCmeasure:ChangeDetection.Changedetectionmeasuresofvisualworking44

memoryhavegainedpopularityasameansofassessingindividualdifferencesincapacity.In45

atypicalchangedetectiontask,participantsbrieflyviewanarrayofsimplevisualitems46


(~100to500ms),suchascoloredsquares,andremembertheseitemsacrossashortdelay47

(~1to2seconds).Attest,observersarepresentedwithanitematoneoftheremembered48

locations,andtheyindicatewhetherthepresentedtestitemisthesameastheremembered49

item(“no-change”trial)orisdifferent(“changetrial”).Performancecanbequantifiedasraw50

accuracyorconvertedintoacapacityestimate(“K”).Incapacityestimates,performancefor51

changetrialsandno-changetrialsiscalculatedseparatelyashits(proportionofcorrect52

changetrials)andfalsealarms(proportionofincorrectno-changetrials)andconvertedinto53

aset-sizedependentscore(Cowan,2001;Pashler,1988;Rouder,Morey,Morey,&Cowan,54

2011).55

Thereareseveralbeneficialfeaturesofchangedetectiontasksthathaveledtotheir56

increasedpopularity.First,changedetectionmemorytasksaresimpleandshortenoughto57

beusedwithdevelopmentalandclinicalpopulations(e.g.Cowan,Fristoe,Elliott,Brunner,58

&Saults,2006;Gold,Wilk,McMahon,Buchanan,&Luck,2003;Leeetal.,2010).Second,the59

relativelyshortlengthoftrialslendsthetaskwelltoneuralmeasuresthatrequirelarge60

numbersoftrials.Inparticular,neuralstudiesemployingchangedetectiontaskshave61

providedstrongcorroboratingevidenceofcapacitylimitsinWM(Todd&Marois,2004;62

Vogel&Machizawa,2004),andhaveyieldedinsightsintopotentialmechanismsunderlying63

individualdifferencesinworkingmemorycapacity(forreview,see:Luria,Balaban,Awh,&64

Vogel,2016).Finally,changedetectiontasksandclosely-relatedmemory-guidedsaccade65

taskscanbeusedwithanimalmodelsfrompigeons(Gibson,Wasserman,&Luck,2011)to66

non-humanprimates(Buschman,Siegel,Roy,&Miller,2011),providingarareopportunity67

todirectlycomparebehaviorandneuralcorrelatesoftaskperformanceacrossspecies68

(Elmore,Magnotti,Katz,&Wright,2012;Reinhartetal.,2012).69


Amainaimofthisstudyistoquantifytheeffectofmeasurementerrorandsample70

sizeonthereliabilityofchangedetectionestimates.Inpreviousstudies,changedetection71

estimatesofcapacityhaveyieldedgoodreliabilityestimates(e.g.Pailian&Halberda,2015;72

Unsworthetal.,2014).However,measurementerrorcanvarydramaticallywiththe73

numberoftrialsinatask,thusimpactingreliability;PailianandHalberda(2015)found74

thatreliabilityofchangedetectionestimatesgreatlyimprovedwhenthenumberoftrials75

wasincreased.Researchersfrequentlyemployvastlydifferentnumbersoftrialsand76

subjectsinstudiesofindividualdifferences,buttheeffectoftrialnumberonchange-77

detectionreliabilityhasneverbeenfullycharacterized.Instudiesusinglargebatteriesof78

tasks,timeandmeasurementerrorareforcesworkinginoppositiontooneanother.When79

researcherswanttominimizetheamountoftimethatatasktakes,measuresareoften80

truncatedtoexpediteadministration.Suchtruncatedmeasuresincreasemeasurement81

noiseandpotentiallyharmthereliabilityofthemeasure.Atpresent,thereisnoclear82

understandingofeithertheminimumnumbersofsubjectsandtrialsthatarenecessaryto83

obtainreliableestimatesofchangedetectioncapacity.84

Inadditiontomeasurementerrorwithin-session,reliabilityofindividualdifferences85

couldbecompromisedwithextensivepractice.Previously,itwasfoundthatvisualworking86

memorycapacityestimateswerestable(r=.77)after1.5yearsbetweentestingsessions87

(Johnsonetal.,2013).However,theeffectofextensivepracticeonchangedetection88

estimatesofcapacityhasyettobecharacterized.Extensivepracticecouldharmthe89

reliabilityandstabilityofmeasuresinacoupleofways.First,itispossiblethatparticipants90

couldimprovesomuchthattheyreachperformanceceiling,thuseliminatingvariability91

betweenindividuals.Second,ifindividualdifferencesareduetotheutilizationofoptimal92


versussub-optimalstrategies,thenparticipantsmightconvergetoacommonmeanafter93

engaginginextensivepracticeandfindingoptimaltaskstrategies.Bothofthese94

hypotheticalpossibilitieswouldcallintoquestionthetruestabilityofworkingmemory95

capacityestimates,andlikewiseseverelyharmthestatisticalreliabilityofthemeasure.As96

such,inExperiment2wedirectlyquantifiedtheextentofextensivepracticeonthestability97

ofworkingmemorycapacityestimates.98

OverviewofExperiments99

Wemeasuredthereliabilityandstabilityofasingle-probechange-detection100

measureofvisualworkingmemorycapacity.InExperiment1,wemeasuredthereliability101

ofcapacityestimatesobtainedwithacommonlyusedversionofthecolorchange-detection102

taskforarelativelylargenumberofsubjects(n=137)andalargerthantypicalnumberof103

trials(t=540).InExperiment2,wemeasuredthestabilityofcapacityestimatesacrossan104

unprecedentednumberoftestingsessions(31).Becauseofthelargenumberofsessions,105

wecouldinvestigatethestabilityofchangedetectionestimatesafterextendedpracticeand106

overaperiodof60days.107

Experiment1108

MaterialsandMethods109

Participants110

Atotalof137individuals(35males;meanage=19.97,SD=1.07)withnormalor111

corrected-to-normalvisionparticipatedintheexperiment.Participantsprovidedwritten112

informedconsent,andthestudywasapprovedbytheEthicsCommitteeatSouthwest113

University.Participantsreceivedmonetarycompensationfortheirparticipation.Two114


participantswereexcludedbecausetheyhadnegativeaveragecapacityvalues,resultingin115

afinalsampleof135subjects.116

Stimuli117

Stimuliwerepresentedonmonitorswitharefreshrateof75Hzandascreen118

resolutionof1024x768.Participantssatapproximately60cmfromthescreen,thougha119

chinrestwasnotusedsoallvisualangleestimatesareapproximate.Inaddition,therewere120

somesmallvariationsinmonitorsize(five16”CRTmonitors,three19”LCDmonitors)in121

testingrooms,leadingtosmallvariationsinthesizeofthecoloredsquaresfrommonitorto122

monitor.Detailsareprovidedabouttheapproximaterangeindegreesofvisualangle.123

AllstimuliweregeneratedinMATLAB(TheMathWorks,Natick,MA)using124

Psychophysicstoolbox.Coloredsquares(51pixels;rangeof1.55oto2.0ovisualangle)125

servedasmemoranda.Squarescouldappearanywherewithinanareaofthemonitor126

subtendingapproximately10.3oto13.35odegreeshorizontallyand7.9oto9.8odegrees127

vertically.Squarescouldappearinanyofninedistinctcolors,andcolorsweresampled128

withoutreplacementwithineachtrial(RGBvalues:Red=25500;Green=02550;Blue=00129

255;Magenta=2550255;Yellow=2552550;Cyan=0255255;Orange=2551280;130

White=255255255;Black=000).Participantswereinstructedtofixateasmallblackdot131

(Approximaterange:.36oto.47ovisualangle)atthecenterofthedisplay.132

Procedures133

Eachtrialbeganwithablankfixationperiodof1,000ms.Then,participantsbriefly134

viewedanarrayof4,6,or8coloredsquares(150ms)whichtheyrememberedacrossa135

blankdelayperiod(1,000ms).Attest,onecoloredsquarewaspresentedatoneofthe136

rememberedlocations.Therewasanequalprobabilitythattheprobedsquarewasthe137


samecolor(no-changetrial)oradifferentcolor(changetrial).Participantsmadean138

unspeededresponsebypressingthe“z”keyifthecolorwasthesameandthe“/”keyifthe139

colorwasdifferent.Participantscompleted180trialsofset-sizes4,6,and8(540trials140

total).Trialsweredividedinto9blocks,andparticipantsweregivenabriefrestperiod(30141

seconds)aftereachblock.Tocalculatecapacity,changedetectionaccuracywas142

transformedintoaKestimateusingCowan’s(2001)formulaK=N×(H−FA),whereN143

representstheset-size,Histhehitrate(proportionofcorrectchangetrials),andFAisthe144

falsealarmrate(proportionofincorrectno-changetrials).Cowan’sformulaisbestfor145

single-probedisplaysliketheoneemployedhere.Forchangedetectiontasksusingwhole-146

displayprobes,Pashler’s(1988)formulamaybemoreappropriate(Rouderetal.,2011).147

Results148

Descriptivestatisticsforeachset-sizeconditionareshowninTable1,anddatafor149

bothExperiment1and2areavailableonlineonOpenScienceFrameworkat150

https://osf.io/g7txf/.Therewasasignificantdifferenceinperformanceacrossset-sizes,151

F(2,268)=20.6,p<.001,hp2=.133,andpolynomialcontrastsrevealedasignificantlinear152

trend,F(1,134)=36.48,p<.001,hp2=.214,indicatingthataverageperformancedeclined153

slightlywithincreasedmemoryload.154

155

156

157

158

159

160


MeanK SD Min Max Kurtosis SkewnessSet-Size4 2.32 .70 .58 3.87 -.49 -.34Set-Size6 2.10 .97 .07 4.80 -.18 .34Set-Size8 1.98 .97 -.18 4.53 -.52 -.14Average 2.14 .82 .38 4.31 -.47 .07

Table1.DescriptivestatisticsforExperiment1.Descriptivestatisticsareshownseparately161

foreachset-sizeandfortheaverageofthethreeset-sizes.Kurtosisandskewnessvalues162

arebothcenteredaround0.Neitherkurtosisnorskewnesswascrediblynon-normalinany163

condition(Cramer,1997).164


ReliabilityoftheFullSample:Cronbach’sAlpha165

WecomputedCronbach’salpha(unstandardized)usingKscoresfromthethreeset-166

sizesasitems(180trialscontributingtoeachitem),andobtainedavalueofa=.91167

(Cronbach,1951).WealsocomputedCronbach’salphausingKscoresfromthenineblocks168

oftrials(60trialscontributingtoeachitem)andobtainedanearlyidenticalvalueofa=169

.92.Finally,wecomputedCronbach’salphausingrawaccuracyforsingletrials(540items),170

andobtainedanidenticalvalueofa=.92.Thus,changedetectionestimateshadhigh171

internalreliabilityforthislargesampleofsubjects,andtheprecisemethodusedtodivide172

trialsinto“items”doesnotimpactCronbach’salphaestimatesofreliabilityforthefull173

sample.Further,usingrawaccuracyversusbias-correctedKscoresdidnotimpact174

reliability.175

ReliabilityoftheFullSample:Split-half176

Thesplit-halfcorrelationoftheKscoresforevenandoddtrialswasreliable,r=.88,177

p<.001,95%CI[.78.88].Correctingforattenuationyieldedasplit-halfcorrelationvalueof178

r=.94(Brown,1910;Spearman,1910).Likewise,thecapacityscoresfromindividualset-179

sizescorrelatedwitheachother:rss4-ss6=.84,p<.001,[95%CI.78.88];rss6-ss8=.78,p<180

.001,[95%CI.72.85];rss4-ss8=.76,p<.001,[95%CI.68.83].Split-halfcorrelationsfor181

individualset-sizesyieldedSpearman-Browncorrectedcorrelationvaluesofr=.91forset-182

size4,r=.86forset-size6,andr=.76forset-size8,respectively.183

Thedropincapacityfromset-size4toset-size8hasbeenusedintheliteratureasa184

measureoffilteringability.However,theinternalreliabilityofthisdifferencescorehas185

typicallybeenlow(Pailian&Halberda,2015;Unsworthetal.,2014).Likewise,wefound186

herethatthesplit-halfreliabilityoftheperformancedeclinefromset-size4toset-size8187


(“4-8Drop”)waslow,withaSpearman-Browncorrectedcorrelationvalueofr=.24.While188

weak,thiscorrelationisthesamestrengthasreportedinearlierwork(Unsworthetal.,189

2014).Thesplit-halfreliabilityoftheperformancedeclinefromset-size4toset-size6was190

slightlyhigher,r=.39,andthesplit-halfreliabilityofthedifferencebetweenset-size6and191

set-size8performancewasverylow,r=.08.Thereliabilityofdifferencesscorescanbe192

impactedbothby(1)theinternalreliabilityofeachmeasureusedtocomputethe193

differenceand(2)thedegreeofcorrelationbetweenthetwomeasures(Rodebaughetal.,194

2016).Althoughtheinternalreliabilityofeachindividualset-sizewashigh,thepositive195

correlationbetweenset-sizesmayhavedecreasedthereliabilityoftheset-sizedifference196

scores.197

AnIterativeDownsamplingApproach198

Toinvestigatetheeffectsofsamplesizeandtrialnumberonreliabilityestimates,we199

usedaniterativedownsamplingprocedure.Tworeliabilitymetricswereassessed:(1)200

Cronbach’salpha,usingsingletrialaccuracyasitemsand(2)split-halfcorrelationsusing201

alltrials.Forthedownsamplingprocedure,werandomlysampledsubjectsandtrialsfrom202

thefulldataset.Numberofsubjects(n)wasvariedfrom5to135instepsof5.Thenumber203

oftrials(t)wasvariedfrom5to540instepsof5.Numberofsubjectsandnumberoftrials204

werefactoriallycombined(2916cellstotal).Foreachcellinthedesign,weran100205

samplingiterations.Oneachiteration,nsubjectsandttrialswererandomlysampledfrom206

thefulldatasetandreliabilitymetricswerecalculatedforthesample.207

Figure1showstheresultsofthedownsamplingprocedureforCronbach’salpha.208

Figure2showstheresultsofthedownsamplingprocedureforsplit-halfreliability209

estimates.Ineachplot,weshowboththeaveragereliabilityobtainedacrossthe100210


iterations(Fig.1AandFig.2A)andtheworstreliabilityobtainedacrossthe100iterations211

(Fig.1BandFig.2B).Conceptually,wecouldthinkofeachiterationofthedownsampling212

procedureasakintorunningone“experiment”withsubjectsrandomlysampledfromour213

“population”of137.Whileitisgoodtoknowtheaverageexpectedreliabilityacrossmany214

experiments,thetypicalexperimenterwillonlyrunanexperimentonce.Thus,considering215

the“worstcasescenario”isinstructiveforplanningthenumberofsubjectsandthenumber216

oftrialstobecollected.Foramorecompletepictureofthebreadthofreliabilitiesobtained,217

wecanalsoconsiderthevariabilityinreliabilityacrossiterations(SD)andtherangeof218

reliabilityvalues(Fig.2C-2D).Finally,werepeatedthisiterativedownsamplingapproach219

foreachindividualset-size.Averagereliabilityaswellasthevariabilityofreliabilityfor220

individualset-sizesareshowninFigure3.Note,eachset-sizebeginswith1/3asmany221

trialsasFigures1and2.222

Next,welookedatsomepotentialcharacteristicsofsampleswithlowreliability(e.g.223

iterationswithparticularlylowversushighreliability).Weran500samplingiterationsof224

30subjectsand120trials,thenwedidamediansplitforhigh-versuslow-reliability225

samples.Therewasnosignificantdifferenceinthemean(p=.86),skewness(p=.60)or226

kurtosis(p=.70)ofhighversuslowreliabilitysamples.Therewas,however,asignificant227

effectofsamplerangeandvariability.Aswouldbeexpected,sampleswithhigherreliability228

hadalargerstandarddeviation,t(498)=26.7,p<.001,95%CI[.14.17],andawiderrange,229

t(498)=15.2,p<.001,95%CI[.52.67]),thansampleswithlowreliability.230

231


Figure1.Cronbach’salphaasafunctionofthenumberoftrialsandthenumberofsubjectsin232

Experiment1.Ineachcell,Cronbach’salphawascomputedforttrials(x-axis)andn233

subjects(y-axis).(a)Averagereliabilityacross100iterations.(b)Minimumreliability234

obtained(worstrandomsampleofsubjectsandtrials).235


Figure2.Spearman-Browncorrectedsplit-halfreliabilityestimatesasafunctionofthe236

numberoftrialsandsubjectsinExperiment1.(a)Averagereliabilityacross100iterations.237

(b)Minimumreliabilityobtained(worstrandomsampleofsubjectsandtrials).(c)238

Standarddeviationofthereliabilityobtainedacrosssamples.(d)Rangeofreliabilityvalues239

obtainedacrosssamples.240


Figure3.Spearman-Browncorrectedsplit-halfreliabilityestimatesforeachset-sizein241

Experiment1.Top3panels:Averagereliabilityforeachset-size.Bottom3panels:Standard242

Deviationofthereliabilityforeachset-sizeacross100downsamplingiterations.243

244

ANoteforFixedCapacity+AttentionEstimatesofCapacity245

Sofar,wehavediscussedonlythemostcommonlyusedmethodsofestimating246

workingmemorycapacity(Kscoresandpercentcorrect).Othermethodsofestimating247

capacityhavebeenused,andwewouldliketobrieflymentiononeofthem.Rouderand248

colleagues(2008)suggestedaddinganattentionallapseparametertoestimatesofvisual249

workingmemorycapacity,amodelreferredtoasFixedCapacity+Attention.Addingan250

attentionallapseparameteraccountsfortrialswheresubjectsareinattentivetothetaskat251

hand.Specifically,participantscommonlymakeerrorsontrialsthatshouldbewellwithin252

capacitylimits(e.g.set-size1),andaddingalapseparametercanhelptoexplainthese253

anomalousdipsinperformance.UnliketypicalestimatesofcapacityinwhichaKvalueis254


computeddirectlyforperformanceforeachset-sizeandthenaveraged,thismodelusesa255

log-likelihoodestimationtechniquethatestimatesasinglecapacityparameterby256

simultaneouslyconsideringperformanceacrossallset-sizesand/orchangeprobability257

conditions.Critically,thismodelassumesthatdataisobtainedforatleastonesub-capacity258

set-size,andthatanyerrormadeonthisset-sizereflectsanattentionallapse.Ifthemodel259

isfittodatathatlacksatleastonesub-capacityset-size(e.g.1or2items),thenthemodel260

willfitpoorlyandprovidenonsensicalparameterestimates.261

Recently,VanSnellenbergandcolleaguesusedtheFixedCapacity+AttentionModel262

tocalculatecapacityforachangedetectiontask,andtheyfoundthatthereliabilityofthe263

model’scapacityparameterwaslow(r=.35),anddidnotcorrelatewithotherworking264

memorytasks(VanSnellenberg,Conway,Spicer,Read,&Smith,2014).Critically,however,265

thisstudyusedonlyrelativelyhighset-sizes(4and8),andlackedasub-capacityset-size,266

somodelfitswerelikelypoor.UsingcodemadeavailablefromRouderetal.,wefitaFixed267

Capacity+Attentionmodeltoourdata(Rouder,n.d.).Wefoundthatwhenthismodelis268

misapplied(i.e.usedondatawithoutatleast1sub-capacityset-size)theinternalreliability269

ofthecapacityparameterwaslow(runcorrected=.35),andnegativelycorrelatedwith270

rawchangedetectionaccuracy,r=-.25,p=.004.Ifwehadonlyappliedthismodeltoour271

data,wewouldhavemistakenlyconcludedthatchangedetectionmeasuresofferpoor272

reliabilityanddonotcorrelatewithothermeasuresofworkingmemorycapacity.273

Discussion274

Here,wehaveshownthatwhensufficientnumbersoftrialsandsubjectsare275

collected,thereliabilityofchangedetectioncapacityisremarkablyhigh(r>.9).Onthe276

otherhand,asystematicdownsamplingmethodrevealedthatinsufficienttrialsor277


insufficientsubjectnumberscoulddramaticallyreducethereliabilityobtainedinasingle278

experiment.Ifresearchershopetomeasurethecorrelationbetweenvisualworking279

memorycapacityandsomeothermeasure,Figures1and2canserveasanapproximate280

guidetoexpectedreliability.Becauseweonlyhadasinglesampleofthelargestn(137),we281

cannotmakedefinitiveclaimsaboutthereliabilityoffuturesamplesofthissize.However,282

giventhestabilizationofcorrelationcoefficientswithlargesamplesizesandtheextremely283

highcorrelationcoefficientobtained,wecanberelativelyconfidentthatthereliability284

estimateforourfullsample(n=137)wouldnotchangesubstantiallyinfuturesamplesof285

universitystudents.Further,wecanmakeclaimsabouthowthereliabilityofsmall,well-286

definedsub-samplesofthis“population”cansystematicallydeviatefromanempirical287

upperbound.288

Theaveragecapacityobtainedforthissamplewasslightlylowerthansomeother289

valuesintheliterature,typicallycitedasaround3-4items.Theslightlyloweraveragefor290

thissamplecouldpotentiallycausesomeconcernaboutthegeneralizabilityofthese291

reliabilityvaluesforfuturesamples.Forthecurrentmanuscript’ssample,averageK-scores292

forset-sizes4and8wereK=2.3andK=2.0,respectively.Thelargest,mostcomparable293

sampletothepresentsampleisa495subjectsampleinworkbyFukuda,Woodman,and294

Vogel(2015).TheaverageK-scoresforset-size4and8wereK=2.7andK=2.4,295

respectively,andthetaskdesignwasnearlyidentical(150msencodingtime,1000ms296

retentioninterval,nocolorrepetitionsallowed,andset-sizes4and8).Thedifferenceof0.3297

–0.4itemsbetweenthesetwosamplesisrelativelysmall,thoughlikelysignificant.298

However,forthepurposesofestimatingreliability,thevarianceofthedistributionismore299

importantthanthemean.Thevariabilityobservedinthepresentsample(SD=0.7forset-300


size4,SD=.97forset-size8)wasverysimilartothatobservedintheFukudaetal.sample301

(SD=0.6forset-size4andSD=1.2forset-size8),thoughunfortunatelytheFukudaeal.302

studydidnotreportreliability.Becauseofthenearlyidenticalvariabilityofscoresacross303

thesetwosamples,wecaninferthatourreliabilityresultswouldindeedgeneralizeto304

otherlargesamplesforwhichchangedetectionscoreshavebeenobtained.305

Werecommendapplyinganiterativedownsamplingapproachtoothermeasures306

whereexpediencyoftaskadministrationisvalued,butreliabilityisparamount.Thestats-307

savvyreadermaynotethattheSpearman-Brownprophecyformulaalsoallowsoneto308

calculatehowmanyobservationsmustbeaddedtoimproveexpectedreliability,according309

totheformula:310

𝑁 =#∗%%&(()#%%&)

#%%&(()#∗%%&)311

Where𝜌 ∗--& isthedesiredcorrelationstrength,𝜌--& istheobservedcorrelationandNis312

thenumberoftimesthattestlengthmustbemultipliedtoachievethedesiredcorrelation313

strength.Critically,however,thisformuladoesnotaccountfortheaccuracyoftheobserved314

correlation.Thus,ifonestartsfromanunreliablecorrelationcoefficientobtainedwitha315

smallnumberofsubjectsandtrials,onewillobtainanunreliableestimateofthenumberof316

observationsneededtoimprovecorrelationstrength.Inexperimentssuchasthisone,both317

numberoftrialsandnumberofsubjectswilldrasticallychangeestimatesofthenumberof318

subjectsneededtoobservecorrelationsofadesiredstrength.319

Let’stakeanexamplefromouriterativedownsamplingprocedure.Imaginethatwe320

ran100experiments,eachwith15subjectsand150totaltrialsofchangedetection.Doing321

so,wewouldobtain100differentestimatesofthestrengthofthetruesplit-half322

correlation.WecouldthenapplytheSpearman-Brownformulatoeachofthese100323


estimatesinordertocalculatethenumberoftrialsneededtoobtainadesiredreliabilityof324

r=.8.Sodoing,wewouldfindthat,onaverage,wewouldneedaround140trialstoobtain325

thedesiredreliability.However,becauseofthelargevariabilityintheobservedcorrelation326

strength(r=.37to.97),ifwehadonlyrunthe“bestcase”experiment(r=.97),wewould327

estimatethatweneedonly18trialstoobtainourdesiredreliabilityofr=.8with15328

subjects.Ontheotherhand,ifwehadrunthe“worstcase”experiment(r=.37),thenwe329

wouldestimatethatweneed1,030trials.Therearedownsidestobothtypesofestimation330

errors.Whileapessimisticestimateofthenumberoftrialsneeded(>1000)wouldcertainly331

ensureadequatereliability,thismaycomeatthecostoftimeandparticipants’frustration.332

Conversely,anoverlyoptimisticestimateofthenumberoftrialsneeded(<20)wouldlead333

tounderpoweredstudiesthatwastetimeandfunds.334

Finally,weinvestigatedanalternativeparameterizationofcapacitybasedona335

modelthatassumesafixedcapacityandanattentionlapseparameter(Rouderetal.,2008).336

Critically,thismodelattemptstoexplainerrorsforset-sizesthatarewellwithincapacity337

limits(e.g.1item).Ifresearchersinappropriatelyapplythismodeltochangedetectiondata338

withonlylargeset-sizes,theywoulderroneouslyconcludethatchangedetectiontasks339

yieldpoorreliabilityandfailtocorrelatewithotherestimatesofcapacity(e.g.Van340

Snellenbergetal.,2014).341

InExperiment2,weshiftedourfocustothestabilityofchangedetectionestimates.342

Thatis,howconsistentareestimatesofcapacityfromday-to-day?Wecollectedan343

unprecedentednumberofsessionsofchangedetectionperformance(31)spanning60344

days.Weexaminedthestabilityofcapacityestimates,definedasthecorrelationbetween345

individuals’capacityestimatesfromonedaytothenext.Sincecapacityisthoughttobea346


stabletraitoftheindividual,wepredictedthatindividualdifferencesincapacityshouldbe347

reliableacrossmanytestingsessions.348

Experiment2349

MaterialsandMethods350

Participants.351

79individuals(male:22;female:57;meanage=22.67years,SD=2.31)with352

normalorcorrected-to-normalvisionparticipatedformonetarycompensation.Thestudy353

wasapprovedbytheEthicsCommitteeofSouthwestUniversity.354

Stimuli355

Someexperimentalsessionswerecompletedinthelabandotherswerecompleted356

inparticipants’homes.Inthelab,stimuliwerepresentedonmonitorswitharefreshrateof357

75Hz.Athome,stimuliwerepresentedonlaptopscreenswithsomewhatvariablerefresh358

ratesandsizes.Inbothcases,participantssatapproximately60cmfromthescreen,though359

achinrestwasnotusedsoallvisualangleestimatesareapproximate.Inthelab,therewere360

somesmallvariationsinmonitorsize(five18.5”LCDmonitors,one19”LCDmonitor)in361

testingrooms,leadingtosmallvariationsinthesizeofthecoloredsquares.Detailsare362

providedabouttheapproximaterangeindegreesofvisualangleinthelab.363

AllstimuliweregeneratedinMATLAB(TheMathWorks,Natick,MA)using364

Psychophysicstoolbox.Coloredsquares(51pixels;rangeof1.28oto1.46ovisualangle)365

servedasmemoranda.Squarescouldappearanywherewithinanareaofthemonitor366

subtendingapproximately14.4oto14.8odegreeshorizontallyand8.1oto8.4odegrees367

vertically.Squarescouldappearinanyofninedistinctcolors(RGBvalues:Red=25500;368

Green=02550;Blue=00255;Magenta=2550255;Yellow=2552550;Cyan=0255255;369


Orange=2551280;White=255255255;Black=000).Colorsweresampledwithout370

replacementforset-size4andset-size6trials.Eachcolorcouldberepeatedupto1timein371

set-size8trials(i.e.colorsweresampledfromalistof18colors,witheachofthe9unique372

colorsappearingtwice).Participantswereinstructedtofixateasmallblackdot(~.3ovisual373

angle)atthecenterofthedisplay.374

Procedures375

TrialproceduresforthechangedetectiontaskwereidenticaltoExperiment1.376

Participantscompletedatotalof31sessionsofthechangedetectiontask.Ineachsession,377

participantscompletedatotalof120trials(splitover5blocks).Therewere40trialseach378

ofset-sizes4,6,and8.Participantswereaskedtofinishthechangedetectiontaskoncea379

dayfor30consecutivedays.Theycoulddothistaskontheirowncomputersoronthe380

experimenters’computersthroughouttheday.Participantswereinstructedthatthey381

shouldcompletethetaskinarelativelyquietenvironmentandnotdoanythingelse(e.g.382

talkingtoothers)atthesametime.Experimentersremindedtheparticipantstofinishthe383

taskandcollectedthedatafileseveryday.384

Results385

DescriptiveStatistics386

DescriptivestatisticsforaverageKvaluesacrossthe31sessionsareshowninTable387

2.Acrossallsessions,theaveragecapacitywas2.83(SD=.23).Changeinmeancapacity388

overtimeisshowninFigure4A.ArepeatedmeasuresANOVArevealedasignificant389

differenceincapacityacrosssessions,F(18.76,1388.38)1=15.04,p<.001,hp2=.169.390

Subjects’performanceinitiallyimprovedacrosssessions,thenleveledoff.Thegroup-391

1Greenhouse-GeisservaluesreportedwhenMauchly’sTestofSphericityisviolated.


averageincreaseincapacityovertimeiswell-describedbyatwo-termexponentialmodel392

(SSE=.08,RMSE=.06,AdjustedR2=.94),describedbytheequation:𝑦 = 2.776×𝑒.556- −393

.798×𝑒).:;- .Totesttheimpressionthatindividuals’improvementslowedovertime,we394

fitseveralgrowthcurvemodelstothedatausingMaximumLikelihoodEstimation395

(‘fitmle.m’)withSubjectenteredasarandomfactor.Wecodedtimeasdaysfromthefirst396

session(Session1=0).ModelAincludedonlyarandomintercept;ModelBincludeda397

randominterceptandarandomlineareffectoftime;ModelCaddedinaquadraticeffectof398

time,andModelDaddedacubiceffectoftime.AsshowninTable3,thequadraticmodel399

providedthebestfittothedata.Furthertestingrevealedthatbothrandomslopesand400

interceptswereneededtobestfitthedata(Table4,ModelsC1-C4).Thatis,participants401

startedoutwithdifferentbaselinecapacityvalues,andtheyimprovedatdifferentrates.402

However,thecovariancematrixforModelCrevealedthattherewasnosystematic403

relationshipbetweeninitialcapacity(intercept)andeitherthelineareffectoftime,r=.21,404

95%CI[-.10.49],orthequadraticeffectoftime,r=-.14,95%CI[-.48.24].Thissuggests405

thattherewasnomeaningfulrelationshipbetweenaparticipant’sinitialcapacityandtheir406

rateofimprovement.Tovisualizethispoint,wedidaquartilesplitofsession1407

performance,andthenplottedthechangeforeachofeachgroup(Figure4). 408


Figure4.Averagecapacity(K)acrosstestingsessions.Shadedbarsrepresentstandarderror409

ofthemean.Note,theaxisissplicedbetweendays30and60,asnointerveningdatapoints410

werecollectedduringthistimeLeft:Averagechangeinperformanceovertime.Right:411

Averagechangeinperformanceovertimeforeachquartileofsubjects(quartilesplit412

performedondatafromsession1).413

414

415

416

417

418

419

420


N Mean SD Minimum Maximum Kurtosis SkewnessDay1 79 2.15 0.85 0.40 4.03 -0.69 0.24Day2 79 2.36 0.86 0.07 3.97 -0.24 -0.32Day3 79 2.43 0.82 0.80 4.07 -0.62 -0.29Day4 78 2.51 0.85 0.40 4.10 -0.31 -0.31Day5 79 2.52 0.93 0.57 4.27 -0.55 -0.13Day6 79 2.74 0.92 0.53 4.60 -0.39 -0.20Day7 79 2.73 0.91 0.67 4.63 -0.88 -0.09Day8 79 2.66 0.87 1.03 4.70 -0.66 0.06Day9 79 2.81 0.92 0.50 5.07 -0.18 -0.19

Day10 79 2.86 0.94 0.77 4.70 -0.84 0.01Day11 78 2.79 0.94 0.40 4.27 -0.51 -0.55*Day12 79 2.83 1.01 -0.10 4.80 -0.38 -0.37Day13 78 2.85 0.96 0.37 4.80 -0.57 -0.21Day14 79 3.01 0.95 0.93 5.03 -0.46 -0.11Day15 78 2.85 0.92 0.37 4.37 0.12 -0.73*Day16 79 2.91 0.92 0.23 4.90 -0.05 -0.35Day17 79 2.84 0.90 0.87 4.77 -0.51 -0.18Day18 79 2.93 1.02 0.53 4.73 -0.40 -0.23Day19 79 2.90 0.92 0.87 4.57 -0.69 -0.24Day20 79 2.94 0.92 0.47 4.93 -0.03 -0.32Day21 79 2.98 0.94 0.80 4.90 -0.08 -0.47Day22 79 2.99 0.98 0.83 4.90 -0.65 -0.23Day23 79 2.86 1.05 0.23 5.47 -0.17 -0.14Day24 78 3.00 0.98 0.97 4.77 -0.74 -0.26Day25 79 3.04 0.95 0.67 5.03 -0.41 -0.16Day26 79 3.01 0.93 0.43 5.07 -0.28 -0.34Day27 79 3.09 1.06 0.43 5.00 -0.51 -0.29Day28 79 3.04 0.97 0.33 4.83 -0.22 -0.48Day29 79 3.01 1.04 0.77 5.07 -0.38 -0.33Day30 79 3.02 1.05 0.33 5.00 -0.48 -0.29Day60 79 3.00 1.08 -0.13 5.40 0.29 -0.58*

421

Table2.DescriptivestatisticsforExperiment2.Descriptivestatisticsareshownseparately422

foreachset-sizeandfortheaverageofthethreeset-sizes.Kurtosisandskewnessvalues423

arebothcenteredaround0.Asterisksdenotecredibledeviationfromnormality(Cramer,424

1997).425


Table3.ComparisonofLinear,Quadratic,andCubicgrowthmodels,allwithrandom426

interceptsandslopeswhereapplicable.427

ModelA:InterceptOnly

ModelB:Linear

ModelC:Quadratic

ModelD:Cubic

Intercept 2.83*** 2.60*** 2.41*** 2.29***LinearSlope 0.014*** .037*** .07**QuadraticSlope

-.0005*** -.002*

CubicSlope 2x10-5n.s.-2LL 4366.2 4084.8 3914.7 4231.6BIC 4389.6 4131.6 3992.7 4348.6***p<.001**p<.01*p<.05Table4.Comparisonoffixedversusrandomslopesandintercept.428 ModelC1:

FixedInt.FixedSlope

ModelC2:FixedInt.RandomSlope

ModelC3:RandomInt.FixedSlope

ModelC4:RandomInt.RandomSlope

-2LL 6672.3 4627.7 4009.1 3914.7BIC 6703.5 4682.3 4048.1 3992.7


Within-sessionreliability429

Within-sessionreliabilitywasassessedusingCronbach’salphaandsplit-half430

correlations.Cronbach’salpha(usingsingle-trialaccuracyasitems)yieldedanaverage431

within-sessionreliabilityofa=.76(SD=.04,Min.=.65,Max.=.83).Equivalently,spit-half432

correlationsonK-scorescalculatedfromevenversusoddtrialsrevealedaverage433

Spearman-Browncorrectedreliabilityofr=.76(SD=.05,Min.=.62,Max.=.84).Asin434

Experiment1,usingrawerror(Cronbach’salpha)versusbiasadjustedcapacitymeasures435

(Cowan’sK)didnotaffectreliabilityestimates.Within-sessionreliabilityincreasedslightly436

overtime(Figure5).Cronbach’salphavalueswerepositivelycorrelatedwithsession437

number(1-31),r=.82,p<.001,95%CI[.66,.91],asweresplit-halfcorrelationvalues,r=438

.67,p<.001,95%CI[.41,.83]. 439


Figure5.Changeinwithin-sessionreliabilityacrosssessionsinExperiment2.Therewasa440

significantpositiverelationshipbetweensessionnumber(1:31)andinternalreliability.441


Between-sessionstability442

Wefirstassessedstabilityovertimebycomputingcorrelationcoefficientsforall443

pairwisecombinationsofsessions(465totalcombinations).Missingsessionswere444

excludedfromthecorrelations,meaningthatsomepairwisecorrelationsincluded78445

subjectsinsteadof79(seeTable2).Allsessionscorrelatedwitheachother,meanr=.71446

(SD=.06,Min.=.48,Max.=.86,allp-values<.001).Aheatmapofallpairwisecorrelations447

isshowninFigure6.Themosttemporallydistantsessionsstillcorrelatedwitheachother.448

ThecorrelationbetweenDay1andDay30(28interveningsessions)wasr=.53,p<.001,449

95%CI[.35,.67];thecorrelationbetweenDay30andDay60(0interveningsessions)was450

r=.81,p<.001,95%CI[.72,.88];thecorrelationbetweenDay1andDay60wasr=.59,p<451

.001,95%CI[.41,.71].Finally,weobservedthatbetween-sessionstabilityincreasedover452

time,likelyduetoincreasedinternalreliabilityacrosssessions.Tocomputechangein453

reliabilityovertime,wecalculatedthecorrelationcoefficientfortemporallyadjacent454

sessions(e.g.thecorrelationofsession1andsession2,ofsession2andsession3,etc.).The455

averageadjacent-sessioncorrelationwasr=.76(SD=.05,Min.=.64,Max.=.86),andthe456

strengthofadjacent-sessioncorrelationswaspositivelycorrelatedwithsessionnumber,r457

=.68,p<.001,indicatinganincreaseinstabilityovertime.458

459


Figure6.Correlationsbetweensessions.Left:Correlationsbetweenallpossiblepairsof460

sessions.Colorrepresentsthecorrelationcoefficientofthecapacityestimatesfromeach461

possiblepairwisecombinationofthe31sessions.Allcorrelationvaluesweresignificant,p462

<.001.Right:Illustrationofthesessionsthataremostdistantintime:Day1correlated463

withDay30(28interveningsessions)andDay30correlatedwithDay60(nointervening464

sessions).465


Differencesbytestinglocation466

Wetestedforsystematicdifferencesinperformance,reliability,andstabilityfor467

sessionscompletedathomeversusinthelab.Intotal,therewere41subjectswho468

completedalloftheirsessionsintheirownhome(“homegroup”),27subjectswho469

completedalloftheirsessionsinthelab(“labgroup”),and11subjectswhocompleted470

somesessionsathomeandsomeinthelab(“mixedgroup”).471

Acrossall31sessions,subjectsinthehomegrouphadanaveragecapacityof2.67472

(SD=1.01),thoseinthelabgrouphadanaveragecapacityof3.01(SD=.83)andthosein473

themixedgrouphadanaveragecapacityof2.98(SD=1.04).Onaverage,scoresfor474

sessionsinthehomegroupwereslightlylowerthanscoresforsessionsinthelabgroup,475

t(2101)=-7.98,p<.001,95%CI[-.42,-.25].Scoresforsessionsinthemixedgroupwere476

higherthanforsessionsinthehomegroup,t(1606)=5.0,p<.001,95%CI[.19,.43],but477

werenotdifferentfromthelabgroup,t(1175)=.44,p=.67,95%CI[-.09,.14].478

Interestingly,however,apairedt-testforthemixedgroup(n=11)revealedthatthesame479

subjectsperformedslightlybetterinthelab(M=3.08)andslightlyworseathome,M=480

2.85,t(10)=3.15,p=.01,95%CI[.07,.39].481

Cronbach’salphaestimatesofwithin-sessionreliabilitywereslightlyhigherfor482

sessionscompletedathome(Meana=.76,SD=.05)comparedtosessionscompletedin483

thelab(Meana=.69,SD=.058),t(60)=3.75,p<.001,95%CI[.03.10].Likewise,484

Spearman-BrownCorrectedcorrelationcoefficientswerehigherforsessionscompletedat485

home(Meanr=.79,SD=.07)comparedtointhelab(Meanr=.67,SD=.14),t(60)=4.42,p486

<.001,95%CI[.07,.18].However,thesedifferencesinreliabilitymayresultfrom(1)487

unequalsamplesizesbetweenlabandhomeor(2)unequalaveragecapacitybetween488


groups(3)unequalvariabilitybetweengroups.Onceequatingsamplesizebetweengroups489

andmatchingsamplesforaveragecapacity,differencesinreliabilitywerenolongerstable:490

Acrossiterationsofmatchedsamples,differencesinCronbach’sarangedfromp<.01top491

>.5,anddifferencesinsplit-halfcorrelationsignificancerangedfromp<.01top>.25.492

Next,weexamineddifferencesinstabilityforsessionscompletedathomecompared493

tointhelab.Onaverage,test-retestcorrelationswerehigherforhomesessions(Meanr=494

.72,SD=.08)comparedtolabsessions(Meanr=.67,SD=.10),t(928)=8.01,p<.001,95%495

CI[.04.06].Again,howeverdifferencesintest-retestcorrelationswerenotreliableafter496

matchingsamplesizeandaveragecapacity,differencesincorrelationsignificanceranged497

fromp=.01top=.98.498

Discussion499

Withextensivepracticeovermultiplesessions,weobservedimprovementinoverall500

changedetectionperformance.Thisimprovementwasmostpronouncedoverearly501

sessions,afterwhichmeanperformancestabilizedfortheremainingsessions.Theinternal502

reliabilityofthefirstsession(SpearmanBrowncorrectedr=.71,Cronbach’sa=.67)was503

withintherangepredictedbythelook-uptablecreatedinExperiment1for80subjectsand504

120trials(predictedrange:r=.61to.87anda=.58to.80,respectively).Bothreliability505

andstabilityremainedhighoverthespanof60days.Infact,reliabilityandstability506

increasedslightlyacrosssessions.Animportantconsiderationforanycognitivemeasureis507

whetherornotrepeatedexposuretothetaskwillharmthereliabilityofthemeasure.For508

example,re-exposuretothesamelogicpuzzleswilldrasticallyreducetheamountoftime509

neededtosolvethepuzzlesandinflateaccuracy.Thus,forsuchtasksgreatcaremustbe510

takentogeneratenoveltestversionstobeadministeredatdifferentdates.Similarly,over-511


practiceeffectscouldleadtoasharpdecreaseinvariabilityofperformance(e.g.ceiling512

effects,flooreffects),whichwouldbydefinitionleadtoadecreaseinreliability.Here,we513

demonstratedthatwhilecapacityestimatesincreasewhensubjectsarefrequentlyexposed514

toachangedetectiontask,thereliabilityofthemeasureisnotcompromisedbypractice515

effectsorceilingeffects.516

Wealsoexaminedwhetherreliabilitywasharmedforparticipantswhocompleted517

thechangedetectionsessionsintheirownhomescomparedtothelab.Whileremotedata518

collectionsacrificessomedegreeofexperimentalcontrol,theuseofat-hometestsis519

becomingmorecommonwiththeeaseofremotedatacollectionthroughresourceslike520

Amazon’sMechanicalTurk(Mason&Suri,2012).Reliabilitywasnotnoticeablydisrupted521

bynoisearisingfromsmalldifferencesinstimulussizebetweendifferenttesting522

environments.Aftercontrollingfornumberofsubjectsandcapacity,therewasnolongera523

consistentdifferenceinreliabilityorstabilityforsessionscompletedathomecomparedto524

inthelab.However,capacityestimatesobtainedinsubjects’homesweresignificantly525

lowerthanthoseobtainedinthelab.Largersamplesizesareneededtomorefully526

investigatesystematicdifferencesincapacityandreliabilitybetweentestingenvironments.527

GeneralDiscussion528

InExperiment1,wedevelopedanovelapproachforestimatingexpectedreliability529

infutureexperiments.Wecollectedchangedetectiondatafromalargenumberofsubjects530

andtrials,andthenweusedaniterativedownsamplingproceduretoinvestigatetheeffect531

ofsamplesizeandtrialnumberonreliability.Averagereliabilityacrossiterationswas532

fairlyimpervioustothenumberofsubjects.Instead,averagereliabilityestimatesacross533

iterationsreliedmoreheavilyonthenumberoftrialspersubject.Ontheotherhand,the534


variabilityofreliabilityestimatesacrossiterationswashighlysensitivetothenumberof535

subjects.Forexample,withonly10subjects,theaveragereliabilityestimateforan536

experimentwith150trialswashigh(α=.75)buttheworstiteration(akintotheworst537

expectedexperimentoutof100)gaveapoorreliabilityestimate(α=.42).Ontheother538

hand,therangebetweenthebestandworstreliabilityestimatesdecreaseddramaticallyas539

thenumberofsubjectsincreased.With40subjects,theminimumobservedreliabilityfor540

150trialswasα=.65.541

InExperiment2,weexaminedthereliabilityandstabilityofchangedetection542

capacityestimatesacrossanunprecedentednumberoftestingsessions.Subjects543

completed31sessionsofsingle-probechange-detection.Thefirst30sessionstookplace544

over30consecutivedays,andthelastsessiontookplace30dayslater(Day60).Average545

internalreliabilityforthefirstsessionwasintherangepredictedbythelook-uptablein546

Experiment1.Despiteimprovementsinperformanceacrosssessions,between-subject547

variabilityinKremainedstableovertime(averagetest-retestbetweenall31sessionswas548

r=.76;thecorrelationforthetwomostdistantsessions,Day1andDay60,wasr=.59).549

Interestingly,bothwithin-sessionreliabilityandbetween-sessionreliabilityincreased550

acrosssessions.Ratherthandiminishingduetopractice,reliabilityofWMCestimates551

increasedacrossmanysessions.552

Thepresentworkhasimplicationsforplanningstudieswithnovelmeasuresandfor553

justifyingtheinclusionofexistingmeasuresintoclinicalbatteriessuchastheResearch554

DomainCriteria(RDoC)project(Cuthbert&Kozak,2013;Rodebaughetal.,2016).For555

basicresearch,aninternalreliabilityof0.7isconsideredasufficient“ruleofthumb”for556

investigatingcorrelationalrelationshipbetweenmeasures(Nunnally,1978).Whilethis557


levelofreliability(orevenlower)willallowresearcherstodetectcorrelations,itisnot558

sufficienttoconfidentlyassessthescoresofindividuals.Forthat,reliabilityinexcessof.9559

oreven.95isdesirable(Nunnally,1978).Here,wedemonstratehowthenumberoftrials560

canalterthereliabilityofworkingmemorycapacityestimates;withrelativelyfewtrials561

(~150,around10minutesoftasktime),changedetectionestimatesaresufficientlyreliable562

forcorrelationstudies(α~.8),butmanymoretrialsareneeded(~500)toboostreliability563

tothelevelneededtoassessindividuals(α~.9).Anotherimportantconsiderationfora564

diagnosticmeasureisitsreliabilityacrossmultipletestingsessions.Sometaskslosetheir565

diagnosticvalueonceindividualshavebeenexposedtothemonceortwice.Herewe566

demonstratethatchangedetectionestimatesofworkingmemorycapacityarestable,even567

whenparticipantsarewell-practicedonthetask(3,720trialsover31sessions).568

Onechallengeinestimatingthe“true”reliabilityofacognitivetaskisthatreliability569

dependsheavilyonsamplecharacteristics.Aswehavedemonstrated,varyingthesample570

sizeandnumberoftrialscanyieldverydifferentestimatesofthereliabilityforaperfectly571

identicaltask.Othersamplecharacteristicscanlikewiseaffectreliability;themostnotable572

oftheseissamplehomogeneity.Thesampleusedherewasalargesampleofuniversity573

students,withafairlywiderangeincapacities(approximately0.5–4items).Samples574

usingonlyasubsetofthiscapacityrange(e.g.clinicalpatientgroupswithverylow575

capacity)willbelessinternallyreliablebecauseoftherestrictedrangeofthesub-576

population.Indeed,inExperiment1wefoundthatsamplingiterationswithpoorreliability577

tendedtohavelowervariabilityandasmallerrangeofscores.Thus,carefullyrecording578

samplesize,mean,standarddeviation,andinternalreliabilityinallexperimentswillbe579

criticalforassessingandimprovingthereliabilityofstandardizedtasksusedforcognitive580


research.Intheinterestofreplicability,opensourcecoderepositories(e.g.theExperiment581

Factory)havesoughttomakestandardizedversionsofcommoncognitivetasksbetter-582

categorized,open,andeasilyavailable(Sochatetal.,2016).However,onepotential583

weaknessfortaskrepositoriesisalackofdocumentationaboutexpectedinternal584

reliability.Standardizationoftaskscanbeveryuseful,butitshouldnotbeover-applied.In585

particular,experimentswithdifferentgoalsshouldusedifferenttestlengthsthatbestsuit586

thegoalsoftheexperimentalquestion.WefeelthatprojectssuchastheExperiment587

Factorywillcertainlyleadtomorereplicablescience,andincludingestimatesofreliability588

withtaskcodecouldhelptofurtherthisgoal.589

Finally,theresultspresentedherehaveimplicationsforresearcherswhoare590

interestedindifferencesbetweenexperimentalconditionsandnotindividualdifferences591

perse.Trialnumberandsamplesizewillaffectthedegreeofmeasurementerrorforeach592

conditionusedwithinchangedetectionexperiments(e.g.set-sizes,distractorpresence,593

etc.).Todetectsignificantdifferencesbetweenconditionsandavoidfalsepositives,it594

wouldbedesirabletoestimatethenumberoftrialsneededtoensureadequateinternal595

reliabilityforeachconditionofinterestwithintheexperiment.Insufficienttrialnumbersor596

samplesizescanleadtointolerablylowinternalreliability,andcouldspoilanotherwise597

well-plannedexperiment.598

TheresultsofExperiments1and2revealedthatchangedetectioncapacity599

estimatesofvisualworkingmemorycapacityarebothinternallyreliableandstableacross600

manytestingsessions.Thisfindingisconsistentwithpreviousstudiesshowingthatother601

measuresofworkingmemorycapacityarereliableandstable,includingcomplexspan602

measures(Beckmann,Holling,&Kuhn,2007;Fosteretal.,2015;Klein&Fiss,1999;Waters603


&Caplan,1996)andthevisuospatialn-back(Hockey&Geffen,2004).Themainanalyses604

fromExperiment1suggestconcreteguidelinesfordesigningstudiesthatrequirereliable605

estimatesofchangedetectioncapacity.Whenbothsamplesizeandtrialnumberswere606

high,thereliabilityofchangedetectionwasquitehigh(α>.9).However,studieswith607

insufficientsamplesizesornumberoftrialsfrequentlyhadlowinternalreliability.608

Consistentwiththenotionthatworkingmemorycapacityisastabletraitoftheindividual,609

individualdifferencesincapacityremainedstableovermanysessionsinExperiment2610

despitepractice-relatedperformanceincreases.611

Boththeeffectsoftrialnumberandsamplesizeareimportanttoconsider,and612

researchersshouldbecautiousaboutgeneralizingexpectedreliabilityacrossvastly613

differentsamplesizes.Forexample,inarecentpaperbyFosterandcolleagues(2015),the614

authorsfoundthatcuttingthenumberofcomplexspantrialsbytwo-thirdshadonlya615

modesteffectonthestrengthofthecorrelationbetweenworkingmemorycapacityand616

fluidintelligence.Critically,however,theauthorsusedaround500subjects,andsucha617

largesamplesizewillactasabufferagainstincreasesinmeasurementerror(i.e.fewer618

trialspersubject).Readerswishingtoconductanewstudywithasmallersamplesize(e.g.619

50subjects)wouldbeill-advisedtodramaticallycuttrialnumbersbasedonthisfinding620

alone;asdemonstratedinExperiment1,cuttingtrialnumbersleadstogreatervolatilityof621

reliabilityvaluesforsmallsamplesizesrelativetolargeones.Givenpresentconcernsabout622

powerandreplicabilityinpsychologicalresearch(OpenScienceCollaboration,2015),we623

suggestthatrigorousestimationoftaskreliability,consideringbothsubjectandtrial624

numbers,willbeusefulforplanningbothnewstudiesandreplicationefforts.625

626


References

Beckmann,B.,Holling,H.,&Kuhn,J.-T.(2007).Reliabilityofverbal–numericalworking627

memorytasks.PersonalityandIndividualDifferences,43(4),703–714.628

https://doi.org/10.1016/j.paid.2007.01.011629

Brown,W.(1910).Someexperimentalresultsinthecorrelationofmentalabilities.British630

JournalofPsychology,1904-1920,3(3),296–322.https://doi.org/10.1111/j.2044-631

8295.1910.tb00207.x632

Buschman,T.J.,Siegel,M.,Roy,J.E.,&Miller,E.K.(2011).Neuralsubstratesofcognitive633

capacitylimitations.ProceedingsoftheNationalAcademyofSciences,108(27),634

11252–11255.https://doi.org/10.1073/pnas.1104666108635

Cowan,N.(2001).Themagicalnumber4inshort-termmemory:areconsiderationof636

mentalstoragecapacity.TheBehavioralandBrainSciences,24(1),87-114-185.637

https://doi.org/10.1017/S0140525X01003922638

Cowan,N.,Fristoe,N.M.,Elliott,E.M.,Brunner,R.P.,&Saults,J.S.(2006).Scopeof639

attention,controlofattention,andintelligenceinchildrenandadults.Memory&640

Cognition,34(8),1754–1768.https://doi.org/10.3758/BF03195936641

Cramer,D.(1997).Basicstatisticsforsocialresearch :step-by-stepcalculationsand642

computertechniquesusingMinitab.London ;NewYork:Routledge.643

Cronbach,L.J.(1951).Coefficientalphaandtheinternalstructureoftests.Psychometrika,644

16(3),297–334.https://doi.org/10.1007/BF02310555645

Cuthbert,B.N.,&Kozak,M.J.(2013).Constructingconstructsforpsychopathology:The646

NIMHresearchdomaincriteria.JournalofAbnormalPsychology,122(3),928–937.647

https://doi.org/10.1037/a0034028648


Elmore,L.C.,Magnotti,J.F.,Katz,J.S.,&Wright,A.A.(2012).Changedetectionbyrhesus649

monkeys(Macacamulatta)andpigeons(Columbalivia).JournalofComparative650

Psychology,126(3),203–212.https://doi.org/10.1037/a0026356651

Engle,R.W.,Tuholski,S.W.,Laughlin,J.E.,&Conway,A.R.(1999).Workingmemory,short-652

termmemory,andgeneralfluidintelligence:alatent-variableapproach.Journalof653

ExperimentalPsychology.General,128(3),309–331.654

Foster,J.L.,Shipstead,Z.,Harrison,T.L.,Hicks,K.L.,Redick,T.S.,&Engle,R.W.(2015).655

Shortenedcomplexspantaskscanreliablymeasureworkingmemorycapacity.656

Memory&Cognition,43(2),226–236.https://doi.org/10.3758/s13421-014-0461-7657

Fukuda,K.,Vogel,E.,Mayr,U.,&Awh,E.(2010).Quantity,notquality:therelationship658

betweenfluidintelligenceandworkingmemorycapacity.PsychonomicBulletin&659

Review,17(5),673–679.https://doi.org/10.3758/17.5.673660

Fukuda,K.,Woodman,G.F.,&Vogel,E.K.(2015).IndividualDifferencesinVisualWorking661

MemoryCapacity:ContributionsofAttentionalControltoStorage.InP.Jolicoeur,C.662

Lefebvre,&J.Martinez-Trujillo(Eds.),MechanismsofSensoryWorkingMemory:663

AttentionandPerformanceXXV(pp.105–120).Elsevier.Retrievedfrom664

http://linkinghub.elsevier.com/retrieve/pii/B9780128013717000090665

Gibson,B.,Wasserman,E.,&Luck,S.J.(2011).Qualitativesimilaritiesinthevisualshort-666

termmemoryofpigeonsandpeople.PsychonomicBulletin&Review,18(5),979–667

984.https://doi.org/10.3758/s13423-011-0132-7668

Gold,J.M.,Wilk,C.M.,McMahon,R.P.,Buchanan,R.W.,&Luck,S.J.(2003).Working669

memoryforvisualfeaturesandconjunctionsinschizophrenia.JournalofAbnormal670

Psychology,112(1),61–71.https://doi.org/10.1037/0021-843X.112.1.61671


Hockey,A.,&Geffen,G.(2004).Theconcurrentvalidityandtest?retestreliabilityofa672

visuospatialworkingmemorytask.Intelligence,32(6),591–605.673

https://doi.org/10.1016/j.intell.2004.07.009674

Johnson,M.K.,McMahon,R.P.,Robinson,B.M.,Harvey,A.N.,Hahn,B.,Leonard,C.J.,…675

Gold,J.M.(2013).Therelationshipbetweenworkingmemorycapacityandbroad676

measuresofcognitiveabilityinhealthyadultsandpeoplewithschizophrenia.677

Neuropsychology,27(2),220–229.https://doi.org/10.1037/a0032060678

Klein,K.,&Fiss,W.H.(1999).ThereliabilityandstabilityoftheTurnerandEngleworking679

memorytask.BehaviorResearchMethods,Instruments,&Computers:AJournalofthe680

PsychonomicSociety,Inc,31(3),429–432.681

Lee,E.-Y.,Cowan,N.,Vogel,E.K.,Rolan,T.,Valle-Inclan,F.,&Hackley,S.A.(2010).Visual682

workingmemorydeficitsinpatientswithParkinson’sdiseaseareduetoboth683

reducedstoragecapacityandimpairedabilitytofilteroutirrelevantinformation.684

Brain,133(9),2677–2689.https://doi.org/10.1093/brain/awq197685

Luria,R.,Balaban,H.,Awh,E.,&Vogel,E.K.(2016).Thecontralateraldelayactivityasa686

neuralmeasureofvisualworkingmemory.Neuroscience&BiobehavioralReviews,687

62,100–108.https://doi.org/10.1016/j.neubiorev.2016.01.003688

Mason,W.,&Suri,S.(2012).ConductingbehavioralresearchonAmazon’sMechanicalTurk.689

BehaviorResearchMethods,44(1),1–23.https://doi.org/10.3758/s13428-011-690

0124-6691

Melby-Lervåg,M.,&Hulme,C.(2013).Isworkingmemorytrainingeffective?Ameta-692

analyticreview.DevelopmentalPsychology,49(2),270–291.693

https://doi.org/10.1037/a0028228694


Nunnally,J.C.(1978).Psychometrictheory(2ded).NewYork:McGraw-Hill.695

OpenScienceCollaboration.(2015).Estimatingthereproducibilityofpsychological696

science.Science,349(6251),aac4716-aac4716.697

https://doi.org/10.1126/science.aac4716698

Pailian,H.,&Halberda,J.(2015).Thereliabilityandinternalconsistencyofone-shotand699

flickerchangedetectionformeasuringindividualdifferencesinvisualworking700

memorycapacity.Memory&Cognition,43(3),397–420.701

https://doi.org/10.3758/s13421-014-0492-0702

Pashler,H.(1988).Familiarityandvisualchangedetection.Perception&Psychophysics,703

44(4),369–378.https://doi.org/10.3758/BF03210419704

Reinhart,R.M.G.,Heitz,R.P.,Purcell,B.A.,Weigand,P.K.,Schall,J.D.,&Woodman,G.F.705

(2012).HomologousMechanismsofVisuospatialWorkingMemoryMaintenancein706

MacaqueandHuman:PropertiesandSources.JournalofNeuroscience,32(22),707

7711–7722.https://doi.org/10.1523/JNEUROSCI.0215-12.2012708

Rodebaugh,T.L.,Scullin,R.B.,Langer,J.K.,Dixon,D.J.,Huppert,J.D.,Bernstein,A.,…Lenze,709

E.J.(2016).UnreliabilityasaThreattoUnderstandingPsychopathology:The710

CautionaryTaleofAttentionalBias.JournalofAbnormalPsychology.711

https://doi.org/10.1037/abn0000184712

Rouder,J.N.(n.d.).ApplicationsandSourceCode.RetrievedJune22,2016,from713

http://pcl.missouri.edu/apps714

Rouder,J.N.,Morey,R.D.,Cowan,N.,Zwilling,C.E.,Morey,C.C.,&Pratte,M.S.(2008).An715

assessmentoffixed-capacitymodelsofvisualworkingmemory.Proceedingsofthe716


NationalAcademyofSciencesoftheUnitedStatesofAmerica,105(16),5975–5979.717

https://doi.org/10.1073/pnas.0711295105718

Rouder,J.N.,Morey,R.D.,Morey,C.C.,&Cowan,N.(2011).Howtomeasureworking719

memorycapacityinthechangedetectionparadigm.PsychonomicBulletin&Review,720

18(2),324–330.https://doi.org/10.3758/s13423-011-0055-3721

Shipstead,Z.,Redick,T.S.,&Engle,R.W.(2012).Isworkingmemorytrainingeffective?722

PsychologicalBulletin,138(4),628–654.https://doi.org/10.1037/a0027473723

Sochat,V.V.,Eisenberg,I.W.,Enkavi,A.Z.,Li,J.,Bissett,P.G.,&Poldrack,R.A.(2016).The724

ExperimentFactory:StandardizingBehavioralExperiments.FrontiersinPsychology,725

7.https://doi.org/10.3389/fpsyg.2016.00610726

Spearman,C.(1910).Correlationcalculatedfromfaultydata.BritishJournalofPsychology,727

1904-1920,3(3),271–295.https://doi.org/10.1111/j.2044-8295.1910.tb00206.x728

Todd,J.J.,&Marois,R.(2004).Capacitylimitofvisualshort-termmemoryinhuman729

posteriorparietalcortex.Nature,428(6984),751–754.730

https://doi.org/10.1038/nature02466731

Unsworth,N.,Fukuda,K.,Awh,E.,&Vogel,E.K.(2014).Workingmemoryandfluid732

intelligence:Capacity,attentioncontrol,andsecondarymemoryretrieval.Cognitive733

Psychology,71,1–26.https://doi.org/10.1016/j.cogpsych.2014.01.003734

VanSnellenberg,J.X.,Conway,A.R.A.,Spicer,J.,Read,C.,&Smith,E.E.(2014).Capacity735

estimatesinworkingmemory:Reliabilityandinterrelationshipsamongtasks.736

Cognitive,Affective,&BehavioralNeuroscience,14(1),106–116.737

https://doi.org/10.3758/s13415-013-0235-x738


Vogel,E.K.,&Machizawa,M.G.(2004).Neuralactivitypredictsindividualdifferencesin739

visualworkingmemorycapacity.Nature,428(6984),748–751.740

https://doi.org/10.1038/nature02447741

Waters,G.S.,&Caplan,D.(1996).Themeasurementofverbalworkingmemorycapacity742

anditsrelationtoreadingcomprehension.TheQuarterlyJournalofExperimental743

Psychology.A,HumanExperimentalPsychology,49(1),51–75.744

https://doi.org/10.1080/713755607745

Wood,G.,Hartley,G.,Furley,P.A.,&Wilson,M.R.(2016).WorkingMemoryCapacity,Visual746

AttentionandHazardPerceptioninDriving.JournalofAppliedResearchinMemory747

andCognition.https://doi.org/10.1016/j.jarmac.2016.04.009748

in press at behavior research methods doi : 10.3758/s13428-017-0886-6 · 2017. 3. 6. · doi :...

Documents