(766992060) benefits of testing memory; best practices and boundary conditions

8/16/2019 (766992060) Benefits of Testing Memory; Best Practices and Boundary Conditions.

1/41

1 Benefits of testing

memory

Best practices and boundaryconditions

Henry L. Roediger, III, Pooja K. Agarwal, Sean H. K. K ang and Elizabeth J. Marsh

The idea of a memory test or of a test of academic achievement is often

circumscribed. Tests within the classroom are recognized as important for the

assignment of grades, and tests given for academic assessment or achieve-

ment have increasingly come to determine the course of childr en’s lives: scor e

well on such tests and you advance, are placed in more challenging classes,

and attend better schools. Against this widely acnowledged bacdrop of

the importance of testing in educational life !not "ust in the #$, but all over the world%, it would be difficult to "ustify the claim that testing is not used

enough in educational practice. &n fact, such a claim may seem to be ludicr ouson the face of it. 'owever, this is "ust the claim we will mae in this cha pter :(ducation in schools would greatly benefit from additional testing, andthe need for increased testing probably increases with advancement in the

educational system. &n addition, students should use self-testing as a study

strategy in preparing for their classes. )ow, having begun with an inflammatory claim * we need more testing

in education * let us e+plain what we mean and bac up our claims. irst, we

are not recommending increased use of standardized tests in education,

which is usually what people thin of when they hear the words testing in

education. /ather, we have in mind the types of assessments !tests,

essays, e+ercises% given in the classroom or assigned for homewor. The

reason we

advocate testing is that it re0uires students to retrieve information eff ortfullyfrom memory, and such eff ortful retrieval turns out to be a wonderfully powerful mnemonic device in many cir cumstances.

Tests have both indirect and direct eff ects on learning !/oediger 12arpice, 3445b%. The indirect eff ect is that, if tests are given morefr e0uently,students study more. 6onsider a college class in which there is only a midter mand a final e+am compared to a similar class in which weely 0uizzes ar egiven every riday, in addition to the midterm and the final. A large r esearch

program is not re0uired to determine that students study more in the class

with weely 0uizzes than in the class without them. 7et tests also have a

dir ecteff ect on learning8 many studies have shown that students’ retrieval

of

information on tests greatly improves their later retention of the tested


2/41

material, either compared to a no-intervention control or even compared to a

control condition in which students study the material for an e0uivalent

amount of time to that given to students taing the test. That is, taing a

test on material often yields greater gains than restudying material, as wedocument below. These findings have important educational

implications,

ones that teachers and professors have not e+ploited.&n this chapter, we first report selectively on findings from our lab on the

critical importance of testing !or retrieval% for future remembering. / etrieval

is a powerful mnemonic enhancer. 'owever, testing does not lead to

improvements under all possible conditions, so the remainder of our cha pter will discuss 0ualifications and boundary conditions of test-enhanced

learning,as we call our program !9caniel, /oediger, 1 9cermott, 344;b%.


3/41

story. The sub"ects were told that they should remember the pictures, because

they would be tested on the names of the pictures !which were given in the

story%. The test was free recall, meaning that students were given a blan


4/41


5/41

aff ected performance a wee later8 three prior tests raised recall over C4I

relative to the no-test condition !i.e., !G3 J ;%?; K 44%. @ooed at another way, immediately after study about G3 items could be recalled. &f sub"ects

too three tests "ust after recall, they could still recall G3 items a wee later .The act of taing three tests essentially stopped the forgetting process in its

tracs, so testing may be a mechanism to permit memories to consolidate or

reconsolidate !udai, 3445%.6ritics, however, could pounce on a potential flaw in the


6/41

ig!re ".# /esults from /oediger and 2arpice !3445a, (+periment %. >n theH-minute delayed test, students who had repeatedly studied the materialrecalled it better than those who had studied it once and taen a test.6ramming !repeatedly reading% does wor, at least at very short r etention

intervals. 'owever, on the two delayed tests, the pattern reversed8studying and taing an initial test led to better performance on the delayedtest than did studying the material twice.

modulate the magnitude of the testing eff ect, beginning with the f or matof tests.

THE FORMAT OF TESTS

The power of testing to increase learning and retention has been demon-

strated in numerous studies using a diverse range of materials8 but both

study and test materials come in a multitude of formats. Although the use of

true?false and multiple-choice e+ams is now commonplace in high school

and college classrooms, there was a time !in the 34s and G4s% when these

inds of e+ams were a novelty and referred to as new-type, in contrast

to the more traditional essay e+ams !/uch, 3%. Liven the variety of

testformats, one 0uestion that arises is whether all formats are e0ually

efficaciousin improving retention. &f we want to provide evidence-based r ecommenda-

tions for educators to utilize testing as a learning tool, it is important toascertain if particular types of tests are more eff ective than others.&n a study designed to e+amine precisely this issue, 2ang, 9cermott, and

/oediger !344;% manipulated the formats of both the initial and final

tests * multiple-choice !96% or short answer !$A% * using a fully-cr ossed,


7/41

within-sub"ects design. $tudents read four short "ournal articles, and immedi-

ately afterwards they were given an 96 0uiz, an $A 0uiz, a list of statementsto read, or a filler tas. eedbac was given on 0uiz answers, and the 0uizzes

and the list of statements all targeted the same critical facts. or instance,

after reading an article on literacy ac0uisition, students in the $A condition

generated an answer to


8/41

were significantly worse in the filler-tas condition than the other thr eeconditions, indicating that both testing !with feedbac% and focused r e-e+posure aid retention of the target information. &mportantly, only theinitial

$A condition produced significantly better final performance than the r ead-

statements condition8 the initial 96 and read-statements conditions didnot diff er significantly. /etrieval is a potent memory modifier !B"or, ;H%.These results implicate the processes involved in actively producing inf or ma-tion from memory as the causal mechanism underlying the testing eff ect./egardless of the format of the final test, the initial test format that r e0uir edmore eff ortful retrieval !i.e., short answer% yielded the best final perf or mance,and this condition was significantly better than having read the test answersin isolation.

$imilar results from other studies provide converging evidence that eff ort-ful retrieval is crucial for the testing eff ect !6arpenter 1 e@osh, 34458Llover, C%. Butler and /oediger !344;%, for e+ample, used art historyvideo

lectures to simulate classroom learning. After the lectures, students com-

pleted short answer or multiple-choice tests, or they read statements as in2ang et al. !344;%. >n a final $A test given G4 days later, Butler and

/ oediger

found the same pattern of results: !% retention of target facts was best

when students were given an initial $A 0uiz, and !3% taing an initial 96test produced final performance e0uivalent to reading the test answers !with-out taing a test%. As discussed in a later section of this chapter, these findingshave been replicated in an actual college course !9caniel, Anderson,

erbish, 1 9orrisette, 344;b%.

Although most evidence suggests that tests that re0uire eff ortful r etrievalyield the most memorial benefits, it should be noted that this depends upon

successful retrieval on the initial test !or the delivery of feedbac%. 2ang et al.!344;% had another e+periment identical to the one described earlier e+cept

that no feedbac was provided on the initial tests.


9/41

subse0uent tests. &n a similar vein, it has been shown that items that elicit

errors on a cued recall test have almost no chance of being recalled corr ectly

at a later time unless feedbac is given !Dashler, 6epeda,


10/41

%able "." 9ean proportion recalled in Agarwal et al.’s !344C% (+periment 3 on testformats. Droportion correct was greater for test conditions than study conditions8however, sub"ects predicted learning would be greater for study conditions relative to

test conditions. @earning conditions that contained both testing and feedbac, namelythe closed-boo test with feedbac and the open-boo test conditions, contributed to best performance on the delayed test.

Pro&ortion 'orr e't

(ondition Initial test )ne*wee+ delayed test

$tudy K .N4$tudy 3K .H4$tudy GK .HN

6losed-boo test .5; .HH6losed-boo test with feedbac .5H .55>pen-boo test .C .55$imultaneous answering .CG .H )on-studied contr ol .5

open-boo test. Derhaps a diff erence between these two conditions willemerge with a longer delay for the final test, an outcome that has occurr edin other e+periments !e.g., /oediger 1 2arpice, 3445a%. $uch a findingwould probably further depend on how students approach the open-boo

tests !e.g., whether they attempt retrieval of an answer before consulting the

study material for feedbac, or whether they immediately search the study

material in order to identify the target information%. uture research in our

lab will tacle this topic.&n summary, one reason why testing benefits memory is that it pr omotes

active retrieval of information. )ot all formats of tests are e0ual. Testformats that re0uire more eff ortful retrieval !e.g., short answer% tend to produce a greater boost to learning and retention, compared to test f or matsthat engage less eff ortful retrieval !e.g., multiple-choice%. 'owever, tests thatare more eff ortful or challenging also increase the lielihood of r etrievalfailure, which has been shown to reduce the beneficial eff ect of testing.Therefore, to ameliorate low performance on the initial test, corr ectivefeedbac should be provided. The practical implications of these findingsfor improving learning in the classroom are straightforward: instead of

giving students summary notes to read, teachers should implement mor e

fre0uent testing !of important facts and concepts% * using test formats thatentail eff ortful retrieval * and provide feedbac to correct errors.


11/41

TESTING AND FEEDBAC

Broadly defined, feedbac is information provided following a

r esponse or recollection, which informs the learner about the status of current performance, often leading to improvement in future

performance !/ oediger , Paromb, 1 Butler, 344C%. A great deal of

laboratory and applied research has

e+amined the conditions under which feedbac is, and is not, eff ective in

improving learning and performance.


12/41

1 9arsh, 344%. $imilarly, Butler, 2arpice, and /oediger !344C% suggested

that feedbac reinforces the association between a cue and its target r esponse,

increasing the lielihood that an initially low-confidence correct r esponsewill be produced on a final criterial test./egarding the best time to deliver feedbac, there e+ists a great deal of

debate and confusion, as immediate and delayed feedbac are oper ational-ized diff erently across studies. or instance, the term immediate feedbac has been used to imply feedbac given "ust after each test item or feedbac

provided immediately after a test. >n the other hand, delayed feedbac can

tae place anywhere from C seconds after an item to 3 days after the test

!2uli 1 2uli, CC%, although in many educational settings the feedbac

may actually occur a wee or more later .Butler et al. !344;% investigated the eff ects of type and timing of feedbac

on long-term retention. $ub"ects read prose passages and completed an initial

multiple-choice test. or some responses, they received standardfeedbac !i.e., the correct answer%8 for others they received feedbac by

answering until correct !labeled A#6: i.e., each response was labeled as

correct or incorr ect, and if incorrect they chose additional options until

they answered the 0ues- tion correctly%. 'alf of the sub"ects received the

feedbac immediately after each 0uestion, while the rest received the

feedbac after a -day delay.

>ne wee later, sub"ects completed a final cued recall test, and these data ar eshown in igure .N. >n the final test, delayed feedbac led to substantially better performance than immediate feedbac, while the standard feedbac

condition and the answer-until-correct feedbac condition resulted in similar performance. Butler et al. discussed why some studies find immediate feedbac to be more beneficial than delayed feedbac !e.g., see 2uli 1 2 uli,CC, for a review%. )amely, this inconsistency might occur if learners do not

fully process delayed feedbac, which would be particularly liely in a pplied

studies where less e+perimental control is present. That is, even if students

receive delayed feedbac they may not loo at it or loo only at feedbac on

0uestions that they missed.


13/41

ig!re ". /esults from Butler, 2arpice, and /oediger !344;%. >n the final test,delayed feedbac !B% led to substantially better performance thanimmediate feedbac, while the standard feedbac condition and the

answer-until-correct !A#6% feedbac condition resulted in similar per-formance.


14/41

concentrate on others that have not yet been learned. The assumption is that

learning to a criterion of one correct recitation means that the item is learned

and that further practice on it will be for naught.

2arpice and /oediger !344C% published a study that 0uestions this com-mon wisdom. &n their study, students learned foreign language vocabulary in

the form of $wahili*(nglish words pairs. !$wahili was used because students

were unfamiliar with the language, and yet the word forms were easily pr o-

nounceable for (nglish speaers, such as -ash!aboat %. $tudents learned

N4 pairs under one of four conditions. &n one condition, students studied

and were tested on the N4 pairs in the usual multitrial learning situation

favored by psychologists !study*test, study*test, study*test, study*test,

la beled

the $T condition%. &n a second condition, students received a similar firststudy*test cycle, but if they correctly recalled pairs on the test, these pairs

were dropped from the ne+t study trial. Thus, across the four trials, the

study list got smaller and smaller as students recalled more items. 'owever,in this condition, labeled $ )T, students were tested on all N4 pairs during

each test period. Thus, relative to the $T condition, the $ )T condition

involved fewer study opportunities but the same number of tests. &n a third

condition, labeled $T ), students studied and were tested the same way as

in the other conditions on the first trial, but after the first trial they repeatedly studied the

pairs three more times, but once they had recalled a pair, it was dr oppedfrom the test. &n this condition, the study se0uence stayed the same on f our

occasions, but the number of items tested became smaller and smaller .inally, in a fourth condition denoted $ )T ), after the first study*test trial,items that were recalled were dropped both from the study and test phase of

the e+periment for the additional trials. &n this case, then, the study list and

the test se0uence became shorter over trials. This last condition is most li estandard advice for using flashcards * students studied and were tested on the

pairs until they were recalled, and then they were dropped so that attention

could be devoted to unlearned pairs.&nitial learning on the N4 pairs in the four conditions is shown in igur e

.H, where it can be seen that all four conditions produced e0uivalent learn-

ing. The data in igure .H show cumulative performance, such that studentswere given credit the first time they recalled a pair and not again !for those

conditions in which multiple recalls of a pair were re0uired, $T and $ )T%. Atthe end of the learning phase, students were told that they would come bac a

wee later to be tested again and were ased to predict how many pairs they

would recall. $tudents in all four groups estimated that they would r ecall

about 34 pairs, or H4I of the pairs, a wee later. After all, the learningcurveswere e0ual, so why would we e+pect students’ "udgments to diff er=

igure .5 shows the proportion of items recalled wee later, in each of

the four conditions. $tudents in two conditions did very well !$T and $ )T,

around C4I% and in the other two conditions students did much more poor ly

!$T ) and $ )T ), around GHI%.


15/41

ig!re "./ &nitial cumulative learning performance from 2arpice and / oediger !344C%. All four conditions produced e0uivalent learning.

&n both the $T and $ )T conditions, students were tested on all N4 pairs for all

four trials. )ote that in the $T condition students studied all N4 items f our

times, whereas in the $ )T condition, the items were dropped from study.

'owever, this reduced amount of study did not matter a bit for retention a

wee later. $tudents in the $T )

and $ )

T )

conditions had only enough testingfor each item to be recalled once and, without repeated retrieval, final r ecall

was relatively poor. >nce again, the condition with many morestudy

opportunities !$T )% did not lead to any appreciably better recall a wee

later than the condition that had minimal study opportunities !$ )T )%.

The bottom line from the 2arpice and /oediger !344C% e+periment is

that after students have retrieved a pair correctly once, repeated

r etrieval is the ey to improved long-term retention. /epeated studying after

this point does not much matter .

/ecall that the students in the four conditions predicted that they would doe0ually well, and recall about H4I after a wee. As can be seen in igure .5,

the students who were repeatedly tested actually outperformed their pr edic-

tions, so they underestimated the power of testing. >n the other hand, the

students who did not have repeated testing overestimated how well they

would do. &n a later section, we return to the issue of what students now


16/41

ig!re ".0 inal learning results from 2arpice and /oediger !344C%. $tudents in the$T and $

)T conditions performed very well, and students in the $T

) and$ )T ) conditions did much more poorly. &n both the $T and $ )T condi-tions, students were tested on all N4 pairs for all four trials. $tudents in the$T

)and $ )T ) conditions had only enough testing for each item to be

recalled once and, without repeated retrieval, recall was relatively poor.The condition with many more study opportunities !$T )% did not lead toany appreciably better recall a wee later than the condition that hadminimal study opportunities !$

)T )

%.

about the eff ects of testing and whether they use testing as a study strategywhen left to their own devices.

/epeated testing seems to be great for consolidating information into

long- term memory, but is there an optimal schedule for repeated testing=

@andauer and B"or !;C% argued that a condition they called e+panding

retrieval was optimal, or at least was better than two other schedules

called massed practice and e0ual interval practice. To e+plain, let us stic

with our f or eign- language vocabulary learning e+ample above, -ash!aboat ,

and consider patterns in which three retrievals of the target might be

carried out. &n the immediate test condition, after the item has been presented, -ash!a would be presented three times in a row for boat to

be recalled each time. This condition might provide good practice because,

of course, performance on each test would be nearly perfect. !&n fact, in

most e+periments, this massed testing condition leads to CI or higher

correct recall with pair ed-associates.%

i t i


17/41

The massed retrieval condition will be denoted a 4-4-4 to indicate that the

three retrievals occurred bac to bac, with no other study or test items

between retrievals of the tar get.

A second condition is the e0ual interval schedule, in which tests are givenafter a delay from study and at e0ual intervals after that. $o, in a H-H-Hschedule, a pair lie -ash!aboat would be presented, then five other pairs or tests would occur, and then -ash!a 11 would be given as a cue. Thissame

process would occur two more times. Although distributed retrieval could be beneficial relative to massed retrieval, "ust as distributed study practice is beneficial relative to massed practice, one looming problem occurs in the caseof retrieval * if the first test is delayed, recall on the first test may be low and,as discussed above, low performance on a test can reduce or eliminate the power of the testing eff ect. To overcome this problem, @andauer and B"or !;C% introduced the idea of e+panding retrieval practice, to insure a near lyerrorless early retrieval with a 0uic first test while at the same time gaining

advantages of distributed testing or practice. $o, to continue with our e+ample, in an e+panding schedule of -N-4, students would be tested with

-ash!a * == after only intervening item, then again after N intervening items,

and then after 4 intervening items. The idea behind the e+panding schedule

is familiar to psychologists because it resembles the idea of shaping behavior

by successive appro+imations !$inner, HG, 6hapter 5%8 "ust as schedules of

reinforcement !erster 1 $inner, H;% e+ist to shape behavioral

r esponses, so schedules of retrieval may shape the ability to remember. &f

a student wants to be able to retrieve a vocabulary word long after study, the

e+panding retrieval schedule may help to shape itsr etrieval.@andauer and B"or !;C% conducted two e+periments pitting massed,

e0ual interval, and e+panding interval schedules of retrieval against one

another. or the latter conditions, they used H-H-H spacing and -N-4

spacing. )ote that this comparison e0uates the average interval of spacing

at

H. The materials in one e+periment were fictitious first and last names, suchthat students were re0uired to produce a last name when given the first name.@andauer and B"or measured performance on the three initial tests and thenon a final test given at the end of the e+perimental session. The results for thee+panding and e0ual interval retrieval se0uences are shown in Table .3

f or

%able ".# 9ean proportion recalled in @andauer and B"or ’s !;C% e+periment onschedules of testing8 data are estimated from igures . and .3. (+panding r etrievalschedules were better than e0ual interval schedules on both the initial three tests andthe final criterial test.

Initial tests inal test

" # $

(+panding .5 .HH .H4 .N;(0ual .N3 .N3 .NG .N4


18/41

the three initial tests and then the final, criterial, test. (+panding r etrieval

schedules were better than e0ual interval schedules, as @andauer and B"or predicted, on both the initial three tests and then the final criterial test. The

;I advantage of e+panding interval retrieval to e0ually spaced retrieval onthe final test was small but statistically significant, and this is the comparisonthat the authors emphasized in the paper. They replicated the eff ect in aseparate e+periment with face*name pairs. 'owever, note a curious fact

about the data in Table .3: >ver the four tests shown, performance

dr ops steadily in the e+panding interval condition !from .5 to .N;% whereas

in the

e0ual interval condition performance is essentially flat !.N3 to .N4%. This pat-tern suggests that on a more delayed final test, the curves might cross over and e0ual interval retrieval might prove superior to e+panding r etrieval.

$trangely, for some years researchers did not investigate @andauer and

B"or ’s !;C% intriguing findings, perhaps because they made such good

sense. 9ost of the studies on retrieval schedules compared e+panding and

massed retrieval, but did not include the critical e0ual interval condition

needed to compare e+panding retrieval to another distributed schedule

!e.g., /ea 1 9odigliani, CH%. All studies maing the massed versus e+pand-ing retrieval comparison showed e+panding retrieval to be more eff ective,

and

Balota, uche, and @ogan !344;% have provided an e+cellent review of this

literature. They show conclusively that massed testing is a poor strategy

r ela- tive to distributed testing, despite the fact that massed testing produces

very high performance on the initial tests !much higher than e0ual interval

testing%. Although this might seem commonplace to cognitive psychologistssteeped

in the literature of massed versus spaced presentation !the spacing eff ect%,from a diff erent perspective the outcome is surprising. $inner !HC% pr o-moted the notion of errorless retrieval as being the ey to learning, and he

implemented this approach into his teaching machines and pr ogrammed

learning. 'owever, current research shows that distributed retrieval is

much

more eff ective in promoting later performance than is massed retrieval,

even

though massed retrieval produces errorless perf or mance.>n the other hand, when comparisons are made between e+panding and

e0ual interval schedules, the data are much less conclusive. The other main

point established in the Balota et al. !344;% review is that no consistent evi-

dence e+ists for the advantage of e+panding retrieval schedules over e0ual

interval testing se0uences. A few studies after @andauer and B"or ’s !;C%seminal study obtained the eff ect, but the great ma"ority did not. ,or

e+ample, 6ull !3444% reported four e+periments in which students learneddifficult word pairs. Across e+periments, he manipulated variables such asintertest interval, feedbac or no feedbac after the tests, and testing versus

restudying the material. The general conclusion drawn from the four

e+peri-ments was that distributed retrieval produced much better retention on a final

test than did massed retrieval, but that it did not matter whether the schedule

had uniform or e+panded spacing of tests.

/ecent research by 2arpice and /oediger !344;% and @ogan and Balota


19/41

2ene3ts o4 testing -e-oryG4 Roediger et al.

!344C% actually shows a more interesting pattern. >n tests that occur a day

or more after original learning, e0ual interval schedules of initial testing

actually produce greater long-term retention than do e+panding schedules

!"ust the opposite of @andauer and B"or ’s findings%. /ecall the data inTable .3 and how the e+panding retrieval testing condition showed asteady

decline with repeated tests whereas the e0ual interval schedule showed essen-tially no decline. Because the final test in these studies occurred during thesame session as initial learning, the retention interval for the final test wasfairly short, leaving open the possibility that on a long-delayed test the func-

tions would actually cross. This is "ust what both 2arpice and /oediger

and @ogan and Balota f ound.

2arpice and /oediger !344;% had students learn word pairs taen from

practice tests for the Lraduate /ecord (+am !e.g., sobri6!et

ni'+na-e ,benisonblessing % and tests consisted of giving the first member of the pair and asing for the second. Their initial testing conditions were massed !4-4-

4%, e+panding !-H-% and e0ual interval !H-H-H%. &n addition, they included

two conditions in which students received only a single test after either

intervening pair or H. The design and initial test results are shown on the left

side of Table .G. &nitial test performance was best in the massed condition,

ne+t in the e+panding condition, and worst in the e0ually spaced condition,

the usual pattern, and students in the single test condition recalled less

after a delay of H intervening items than after . There are no surprises in theinitial recall data. 'alf the students too a final test 4 minutes after theinitial learning phase, whereas the rest received the final test 3 days later .These results are shown in the right side of Table .G. irst consider dataat

the 4-minute delay. The top three rows show a very nice replication of the

pattern reported by @andauer and B"or !;C%: (+panding retrieval in theinitial phase produced better recall on the final test than the e0ual intervalschedule, and both of these schedules were better than the massed r etrieval

%able ".$ 9ean proportion recalled in 2arpice and / oediger ’s !344;% e+periment

on schedules of testing. (+panding retrieval in the initial phase produced better r ecallthan the e0ual interval schedule on the 4-minute delayed test, and both of theseschedules were better than the massed retrieval schedule. 'owever, after a 3-daydelay, recall was best in the e0ual interval condition relative to the e+pandingcondition, although both conditions still produced better performance than in themassed condition.

Initial tests inal tests

" # $ "5 -in 7 h

9assed !4-4-4% .C .C .C .N; .34(+panding !-H-% .;C .;5 .;; .; .GG(0ual !H-H-H% .;G .;G .;G .53 .NH$ingle-immediate !% .C .5H .33


20/41

2ene3ts o4 testing -e-oryG Roediger et al.

$ingle-delayed !H% .;G .H; .G4


21/41


schedule. Also, the single-immediate test produced better delayed recall than

the single-delayed test. 'owever, the startling result in this e+periment

appeared for those sub"ects who too the test after a 3-day delay. /ecall

was now best in the e0ual interval condition ! M M .NH% relative to thee+panding condition ! M M .GG%, although both conditions still produced

better perf or m- ance than in the massed condition ! M M .34%. &nterestingly,

performance also reversed across the delay for the two single test conditions:

recall was better in the single-immediate condition after 4 minutes, but was

reliably better in the single-delayed condition after 3 days.

2arpice and /oediger !344;% argued that, congenial as the idea is,

e+panding retrieval is not conducive to good long-term retention.

&nstead, what

seems to be important for long-term retention is the difficulty of the firstretrieval attempt.


22/41


half the statements to be right and half to be wrong, and normally false itemsare plausible in order to re0uire rather fine discriminations. $imilarly, f or multiple-choice tests, students receive a 0uestion stem and then four possible

completions, one of which is correct and three others that are err oneous

!but again, statements that might be close to correct%. Because erroneous

information is presented on the tests, students might learn that incorr ect

information, especially if no feedbac is given !as is often the case in collegecourses%. &f the test is especially difficult !meaning a large number of wr ong

answers are selected%, the students may actually leave a test more confused

about the material than when they waled in. 'owever, even if conditions

ar e such that students rarely commit errors, it might be that simply reading

and carefully considering false statements on true?false tests and

distractors on multiple-choice tests can lead later to erroneous nowledge.

$everal studies have shown that having people simply read statements!whether true or false% increases later "udgments that the statements are

true !Bacon, ;8 Begg,

Armour, 1 2err, CH8 'asher, Loldstein, 1 Toppino, ;;%. This

eff ect

underlies the tactics of propagandists using the big lie techni0ue byr epeating

a statement over and over until the populace believes it, and is also a

favor ed tactic in most #$ presidential elections. &f you repeat an untruth

about an opponent repeatedly, the statement comes to be believed./emmers and /emmers !35% first discussed the idea that incorr ect

information on tests might mislead students, when the new techni0ues of

true?false and multiple-choice testing were introduced into education !/ uch,

3%. They called this outcome the negative suggestibility eff ect, although

not much research was done on it for many years. 9uch later Toppino andhis

colleagues showed that statements presented as distractors on true?false and

multiple-choice tests did indeed accrue truth value from their mere pr esenta-

tion, because these statements were "udged as more true when mi+ed with

novel statements in appropriately designed e+periments !Toppino 1Brochin,

C8 Toppino 1 @uipersbec, G%. &n a similar vein, Brown !CC% and

Oacoby and 'ollingshead !4% showed that e+posing students to misspelled

words increased misspelling of those words on a later oral test.

/oediger and 9arsh !344H% ased whether giving a multiple-choice test

!without feedbac% would lead to a ind of misinformation eff ect !@oftus

et al., ;C%. That is, if students tae a multiple-choice test on a subset of

facts, and then tae a short answer test on all facts, will prior testing incr easeintrusions of multiple-choice lures on the final test= /oediger and 9arsh

conducted a series of e+periments to address this 0uestion, manipulating thedifficulty of the material !and hence level of performance on the multiple-choice test% and the number of distractors given on the multiple-choice test.

Three e+periments were submitted using these tactics, but the editor ased usto drop our first two e+periments !which established the phenomenon% and

report only a third, control, e+periment that showed that the negative sugges-tibility eff ect occurred under tightly controlled but not necessarily r ealisticconditions.


23/41

GG Roediger et al. 2ene3ts o4 testing -e-ory $ $

most interesting e+periments !in our opinion% were not reported. ur first e+periment was e+ploratory, "ust to mae sure we could get theeff ects we sought. f

interest was whether selecting distractors on the multiple-choice test would

lead to their intrusion on a later test.


24/41

2ene3ts o4 testing -e-oryGN Roediger et al.

multiple-choice performance, and this allowed us to see if any negative

eff ects

of testing were limited to conditions in which sub"ects made more errors onthe multiple-choice test !i.e., difficult items and relatively many distractors%.The interesting data are contained in Tables .H and .5 !again, the top

panels devoted to (+periment %. Table .H shows the proportion of short

answer 0uestions answered correctly !bowling in response to


25/41

GH Roediger et al. 2ene3ts o4 testing -e-ory $ /

%able ".0 Droportion target incorrect answers on the cued recall test as a function of 0uestion difficulty and number of alternatives !including the correct answer% on the prior multiple-choice test. )on-guess responses are in parentheses !proportion corr ectnot including those that received a not sure rating%. The prior multiple-choice test

led to more errors on the final test and this eff ect grew larger when more distractors

had been presented on the multiple-choice test. /emoving the not sure r esponsesreduced the size of the negative suggestibility eff ect, but left the basic pattern intact.

8!-ber o4 &re9io!s alternati9es

:ero ;not*tested< % wo %hr ee o!r

(+periment (asy 0uestions .4C .4 . .

!.4H% !.45% !.4% !.4C%

'ard 0uestions .; .35 .GN .G5

!.4% !.34% !.3% !.3N%

(+periment 3/ead passages .4H .4 .4 .

!.4% !.45% !.4H% !.4;%

)on-read passages . .3N .3H .G;!.4G% !.G% !.3% !.H%

The data show clearly that taing a multiple-choice test can simultaneouslyenhance performance on a later cued recall test !a positive testing eff ect% andharm performance !a negative suggestibility eff ect%. The former eff ect comes

from 0uestions answered correctly on the multiple-choice test, whereas thelatter eff ect arises from errors committed on the multiple-choice test. &n fact,;CI of the multiple-choice lure answers on the final test had been selected

erroneously on the prior multiple-choice test. This result is noteworthy because it suggests that any negative eff ects of multiple-choice testing r e0uir eselection of an incorrect answer, and that simply reading the lures !and then

selecting the correct answer% is not pr oblematic.

&n (+periment 3 we e+amined whether students would show the sameeff ects when learning from prose materials.


26/41


The multiple-choice data are displayed in the bottom part of Table .N.

Again, the results are straightforward: )ot surprisingly, performance on

unread passages was lower than for read passages, and performance

generally declined with the number of distractors !albeit more for unreadthan read passages%. >nce again, the manipulations succeeded in varying

multiple- choice performance across a fairly wide r ange.

The conse0uences of multiple-choice testing can be seen in the bottom of Table .H, which shows the proportion of final short answer 0uestionsanswered correctly. A positive testing eff ect occurred for both read andunread passages, although for unread passages the eff ect declined with thenumber of distractors on the prior multiple-choice test. $till, as in the firste+periment, a positive testing eff ect was observed in all conditions, even whennot sure responses were removed !the data in par entheses%.

As can be seen in Table .5, the negative suggestibility eff ect also a ppear ed

in full force in (+periment 3, although it was greater for the nonr ead

passages, with their corresponding higher rate of errors on the multiple-

choice test than for the read passages. or the read passages, the error

r ate nearly doubled after the multiple-choice test, from HI to I.


27/41


28/41

2ene3ts o4 testing -e-oryGC Roediger et al.

answers sounds more interesting when one thins about the importance of endorsing multiple-choice lures for the negative suggestibility eff ect. ne group of

undergraduates was warned they would receive a penalty for wrong

answers and that they should choose a don’t now option if they were not

r eason- ably sure of their answer. Another group was re0uired to answer

all of the

multiple-choice 0uestions. Both groups showed large positive testing eff ects,and smaller negative testing eff ects. 6ritically, the penalty instruction signifi-cantly reduced the negative testing eff ect, although it was still significant./esearch on negative suggestibility is "ust beginning, and only a few

variables have been systematically investigated. Three classes of variables are liely to be interesting: ones that aff ect how liely sub"ects areto selectmultiple-choice lures !e.g., reading related material, a penalty for wr onganswers on the 96 test%, ones that aff ect the lielihood that selected multiple-choice lures are integrated with related world nowledge !e.g., corr ectivefeedbac%, and ones that aff ect monitoring at test !e.g., the warning againstguessing on the final test used in /oediger 1 9arsh, 344H%. The negativetesting eff ect could change in size for any of these reasons. or e+ample,consider one recent investigation involving the eff ects of adding a none of the above option to the 96 test !>degard 1 2oen, 344;%.


29/41

G Roediger et al. 2ene3ts o4 testing -e-ory $ >

constraints !e.g., time pressure%, and online monitoring during the learning

e+perience !i.e., sub"ective assessments of how well the material has been

learned8 Ben"amin, 344;%. &n other words, a student’s beliefs about learning

and memory and his or her sub"ective evaluations during the learning e+peri-ence are vital to eff ective learning !unlosy, 'ertzog, 2ennedy, 1

Thiede,

344H%. &n this section we shall discuss the metacognitive factors concomitant

with testing, how testing can improve monitoring accuracy, as well as the use

of self-testing as a study strategy by students.

/esearch on metacognition provides a framewor for e+amining how

students strategically monitor and regulate their learning. Monitoring r efers

to a person’s sub"ective assessment of their cognitive processes, and 'ontr ol

refers to the processes that regulate behavior as a conse0uence of monitoring!)elson 1 )arens, 4%. >ne indirect way in which testing can enhance

future learning is by allowing students to better monitor their learning !i.e.,

discriminate information that has been learned well from that which has

not been learned%. (nhanced monitoring in turn influences subse0uentstudy behavior, such as having students channel their eff orts towardslesswell-learned materials. A survey of college students’ study habitsr evealed

that students are generally aware of this function of testing !2ornell 1

B"or ,

344;%. &n response to the 0uestion &f you 0uiz yourself while you study, whydo you do so= 5CI of respondents chose To figure out how well & havelearned the information &’m studying, while only CI selected & learn mor e

that way than through rereading, suggesting that relatively few students

view testing as a learning event !see too 2arpice, Butler, 1 /oediger, 344%.

To gain insight into sub"ects’ monitoring abilities, researchers as

them to mae "udgments of learning !O>@s%. )ormally done during study,

students predict their ability to remember the to-be-learned information

at a later point in time !usually on a scale of 4*44I%, and then these

predictions ar e compared to their actual performance. #sually people are

moderately accur - ate when maing these predictions in laboratory paradigms !e.g., Arbucle 1 6uddy, 5%, but O>@s are inferential in

nature and can be based on a variety of beliefs and cues !2oriat, ;%.

The accuracy of one’s metacognitive monitoring depends on the e+tent to

which the beliefs and cues that one uses are diagnostic of future memory

performance * and some of students’ beliefs about learning are wrong. or

e+ample, sub"ects believe that items that are easily processed will be

easy to retrieve later !e.g., Begg, uft, @alonde, 9elnic, 1 $anvito,

C%, whereas we have already discussedthat more eff ortful retrieval is more liely to promote retention.

$imilar ly,

students tend to give higher O>@s after repeated study than after r eceivinginitial tests on the to-be-remembered material, but actual final memory per-formance e+hibits the opposite pattern !i.e., the testing eff ect8 Agarwal et al.,344C8 2ang, 344a8 /oediger 1 2arpice, 3445a%. /epeated studying of thematerial probably engenders greater processing fluency, which leads to anoverestimation of one’s future memory perf or mance.

$tudents’ incorrect beliefs about memory mean that they often engage in


30/41

2ene3ts o4 testing -e-oryN4 Roediger et al.

suboptimal learning strategies. or e+ample, O>@s are often negatively cor-

related with study times during learning, meaning that students spend mor etime studying items that they feel are difficult and that they still need to

master !$on 1 9etcalfe, 34448 although see 9etcalfe 1 2ornellE344GF

for conditions that produce an e+ception to this generalization%. )ot only is

testing a better strategy, but sometimes substantial increases in study time ar e

not accompanied by e0uivalent increases in performance, an outcome ter medthe labor-in-vain eff ect !)elson 1 @eonesio, CC%.

6onsider a study by 2arpice !in press% that e+amined sub"ects’ str ategiesfor learning $wahili*(nglish word pairs. 6ritically, the e+periment had

repeated study*test cycles !multi-trial learning% and once sub"ects were able

to correctly recall the (nglish word !when cued with the $wahili word% they

wer e given the choice of whether to restudy, test, or drop an item for the

upcoming

trial, with the goal of ma+imizing performance on a final test wee later .$ub"ects chose to drop the ma"ority of items !54I%, while about 3HI and HI

of the items were selected for repeated testing and restudy,

r espectively. $ub"ects also made O>@s before maing each choice, and

items selected f or restudy were sub"ectively the most difficult !i.e., lowest O>@s%, dropped

items

were perceived to be the easiest, and items selected for testing were in between. As e+pected, final performance increased as a function of the pr o- portion of items chosen to be tested, whereas there was no r elationship between the proportion of items chosen for restudy and final recall. inally,there was a negative correlation between the proportion of items dr oppedand final recall, indicating that sub"ects dropped items before they had firmlyregistered the pair .

These results suggest that learners often mae suboptimal choices during

learning, opting for strategies that do not ma+imize subse0uent r etention.

Also, the tendency to drop items once they were recalled the first time r eflectsoverconfidence and under-appreciation of the value of practicing r etrieval.ollow-up research in our lab !2ang, 344b% is investigating whether e+periencing the testing eff ect !i.e., performing well on a final test for items

previously tested, relative to items that were previously dropped orr estudied%can induce learners to select preferentially self-testing study strategies that

enhance future recall. @s suggest an important

role for testing in improving monitoring accuracy. elayed O>@s refer to

ones solicited at some delay after the items have been studied, whereas

immediate O>@s are solicited immediately after each item has been

studied. elayed O>@s are typically more accurate than immediate O>@s

!e.g., )elson 1

unlosy, %. This delayed O>@ eff ect is obtained only under certainconditions, specifically when the O>@s are cue-only O>@s. This term r efersto the situation in which studied items are A*B pairs and sub"ects are pr o-

vided only with A when ased to mae their prediction for later recall of the


31/41

N Roediger et al. 2ene3ts o4 testing -e-ory "

target B8 the eff ect does not occur when O>@s are sought with intact cue-target pairs presented !unlosy 1 )elson, 3%. >ne e+planation for this

finding is that sub"ects attempt retrieval of the target for cue-only

delayedO>@s, and success or failure at retrieval then guides sub"ects’ pr edictions!i.e., a high O>@ is given if the target is successfully retrieved8 if not then a

low O>@ is given%. This enhanced ability to distinguish well-learned fromless well-learned items, coupled with the testing eff ect on items r etrieved

successfully during the delayed O>@, has been proposed to account f or the increased accuracy of delayed O>@s !$pellman 1 B"or, 38 2elemen 1

@s were not always accurate: 2oriat and B"or !344H% had sub"ects learn

paired associates, including forward-associated pairs !e.g., 'heddar'heese%,

bacwards-associated pairs !e.g., 'heese.'heddar %, and unrelated pairs.

uring learning, sub"ects were ased to "udge how liely it was that they

would remember the 3nd word in the pair. $ub"ects over-predicted their

ability to remember the target in the bacwards-associated pairs, and the

authors dubbed this an illusion of competence. n the

first study*test cycle, sub"ectsshowed the same overconfidence for the bacward-associated pairs, but O>@s

and recall performance became better calibrated with further study* testopportunities. This finding suggests that prior test e+perience can enhancelearners’ sensitivity to retrieval conditions on a subse0uent test, and can be a

way to improve metacognitive monitoring.

The studies discussed in this section converge on the general conclusionthat the ma"ority of college students are unaware of the mnemonic benefit of testing:


32/41

N3 Roediger et al. 2ene3ts o4 testing -e-ory #

engage in suboptimal study behavior. (ven though college students might be


33/41


e+pected to be e+pert learners !given their many years of schooling and

e+perience preparing for e+ams%, they often labor in vain !e.g., rereading the

te+t% instead of employing strategies that contribute to robust learning and

retention. $elf-testing may be unappealing to many students because of thegreater eff ort re0uired compared to rereading, but this difficulty during learn-ing turns out to be beneficial for long-term performance !B"or, N%. Ther e-fore, the challenge for future research is to uncover conditions thatencour age

learners to set aside their naSve intuitions when studying and opt for retrieval-

based strategies that yield lasting r esults.

A""LICATIONS OF TESTING IN CLASSR OOMS

/ecently, several "ournal articles have highlighted the importance of using

tests and 0uizzes to improve learning in real educational situations. The

notion of using testing to enhance student learning is not novel, however ,

as Lates employed this practice with elementary school students in ;

!see too Oones E3GF and $pitzer EGF%. >ne cannot, however, assume

thatlaboratory findings necessarily generalize to classroom situations, given

thatsome laboratory parameters !e.g., relatively short retention intervals,tighte+perimental control% do not correspond well to naturalistic conte+ts. This

distinction has garnered interest recently and we will outline a few studiesthat have evaluated the efficacy of test-enhanced learning within a

classr oom

conte+t.

@eeming !3443% adopted an e+am-a-day procedure in two sections

of &ntroductory Dsychology and two sections of his summer @earning

and 9emory course, for a total of 33 to 3N e+ams over the duration of the

courses. &n comparable classes taught in prior semesters, students hadreceived only four e+ams. inal retention was measured after 5 wees.

@eem-

ing found significant increases in performance between the e+am-a-day

pr o-cedure and the four e+am procedure in both the &ntroductory Dsychology

sections !C4I vs. ;NI% and @earning and 9emory sections !CI vs. CI%.

&n addition, the percentage of students who failed the course decr eased

following the e+am-a-day procedure. @eeming’s students also participated in

a survey, and students in the e+am-a-day sections reported increased inter estand studying for class.

9caniel et al. !344;a% described a study in an online Brain and Behavior

course that used two inds of initial test 0uestions, short answer and

multiple-choice, as well as a read-only condition.


34/41

2ene3ts o4 testing -e-oryNG Roediger et al.

e+aminations in multiple-choice format were given after G wees of

0uizzes,and a final cumulative multiple-choice assessment at the end of the semester

covered material from both units. Although facts targeted on the initial0uizzes were repeated on the unit?final e+ams, the 0uestion stems wer e


35/41

NG Roediger et al. 2ene3ts o4 testing -e-ory $

phrased diff erently so that the learning of concepts was assessed rather than

memory for a prior test response. >n the two unit e+ams, retention f or 0uizzed material was significantly greater than that for non-0uizzed material,regardless of the initial 0uiz format. >n the final e+am, however, only shortanswer !but not multiple-choice% initial 0uizzes produced a significant benefitabove non-0uizzed and read-only material. The results from this study provide further evidence of the strength of the testing eff ect in classr oomsettings, as well as replicating prior findings showing that short answer tests produce a greater testing eff ect than do multiple-choice tests !e.g., Butler 1/oediger, 344;8 2ang et al., 344;%.

9caniel and $un !344% replicated these findings in a more tr aditional

college classroom setting, in which students too two short-answer 0uizzes

per wee. The 0uizzes were emailed to the students, who had to complete

them by noon the ne+t day. After emailing their 0uiz bac to the pr ofessor ,

students received an email with the 0uiz 0uestions and correct answers.

/etention was measured on unit e+ams, composed of 0uizzed and non-

0uizzed material, and administered at the end of every wee. Derformance f or 0uizzed material was significantly greater than performance on non-0uizzedmaterial.

inally, /oediger, 9caniel, 9cermott, and Agarwal !344% conducted

various test-enhanced learning e+periments at a middle school in &llinois. The

e+periments were fully integrated into the classroom schedules and used

material drawn directly from the school and classroom curriculum. &n the

first study, 5th grade social studies, ;th grade (nglish, and Cth grade

science

students completed initial multiple-choice 0uizzes over half of the classr oom

material. The other half of the material served as the control material. Theteacher in the class left the classroom during administration of 0uizzes, so she

did not now the content of the 0uizzes and could not bias her instruction

toward !or against% the tested material. The initial 0uizzes included a pr e-test

before the teacher reviewed the material in class, a post-test immediately

following the teacher ’s lecture, and a review test a few days after the

teacher ’s lecture. #pon completion of a G- to 5-wee unit, retention was

measured on chapter e+ams composed of both 0uizzed and non-0uizzed

material. At all

three grade levels, and in all three content areas, significant testingeff ects

were revealed such that retention for 0uizzed material was greater thanfor non-0uizzed material, even up to months later !at the end of the

school year%. The results from Cth grade science, for e+ample, can be seen in

igure .;.

This e+periment was replicated with 5th grade social studies students

who, instead of completing in-class multiple-choice 0uizzes, participated in

games online using an interactive website at their leisure. This design was

implemented in order to minimize the amount of class time re0uired for atest-enhanced learning program. espite being left to their own devices,

students still performed better on 0uizzed material available online than non-0uizzed material on their final chapter e+ams. urthermore, in a

subse0uent


36/41

2ene3ts o4 testing -e-oryNN Roediger et al.

ig!re ".= $cience results from /oediger, 9caniel, 9cermott, and Agarwal!344%. $ignificant testing eff ects in a middle school setting were r evealedsuch that retention for 0uizzed material was greater than for non-0uizzedmaterial, even up to months later !at the end of the school year%.

e+periment with 5th grade social studies students, a read-only conditionwas included, and performance for 0uizzed material was still significantlygreater than read-only and non-0uizzed material, even when the number of

e+posures were e0uated between the 0uizzed and read-only condition.&n sum, recent research is beginning to demonstrate the robust eff ects of

testing in applied settings, including middle school and college classr ooms.

uture research e+tending to more content areas !e.g., math%, age gr oups

!e.g., elementary school students%, methods of 0uizzing !e.g., computer-based

and online%, and types of material !e.g., application and transfer 0uestions%,

we e+pect, will only provide further support for test-enhanced learning

pr ograms.

CONCL!SION

&n this chapter, we have reviewed evidence supporting test-enhancedlearning in the classroom and as a study strategy !i.e., self-testing% for

impr oving student performance. re0uent classroom testing has both indirect

and dir ect benefits. The indirect benefits are that students study for more time and

with

is ibt t


37/41

NH Roediger et al. 2ene3ts o4 testing -e-ory /

greater regularity when tests are fre0uent, because the specter of a loomingtest encourages studying. The direct benefit is that testing on material servesas a potent enhancer of retention for this material on future tests, either

relative to no activity or even relative to restudying material. Dr oviding

correct answer feedbac on tests and insuring that students carefully pr ocessthis feedbac greatly enhances this testing eff ect. eedbac is especially

important when initial test performance is low. 9ultiple tests produce alarger testing eff ect than does a single test. &n addition, tests re0uiring pr o-duction of answers !short answer or essay tests% produce a greater testingeff ect than do recognition tests !multiple-choice or true?false%. The latter testsalso have the disadvantage of e+posing students to erroneous

inf or mation, but giving feedbac eliminates this problem. Test-enhanced

learning is not limited to laboratory materials8 it improves performance

with educational materials !foreign language vocabulary, science passages%

and in actual classroom settings !ranging from middle school classes in

social studies, (nglish, and science, to university classes in introductory

psychology and biological bases of behavior%. n the analysis of the factors of recall in the learning pr ocess.

Psy'hologi'al Monogra&hs, "", H* ;;.

Agarwal, D. 2., 2arpice, O. ., 2ang, $. '. 2., /oediger, '. @., 1 9cermott, 2.B. !344C%. (+amining the testing eff ect with open- and closed-boo tests. A &&lied (ogniti9e Psy'hology, ##, C5* C;5.

Arbucle, T. 7., 1 6uddy, @. @. !5%. iscrimination of item strength at time of

presentation. Jo!rnal o4 E?&eri-ental Psy'hology, 7", 35* G.

Bacon, . T. !;%. 6redibility of repeated statements: 9emory for trivia.

J o!rnal o4 E?&eri-ental Psy'hology@ H!-an Learning and Me-ory, /, 3N* 3H3.

Balota, . A., uche, O. 9., 1 @ogan, O. 9. !344;%. &s e+panded retrieval pr actice

a superior form of spaced retrieval= A critical review of the e+tant literature.

&n O. $. )airne !(d.%, %he 4o!ndations o4 re-e-bering@ Essays in honor o4 Henry

L. Roediger, III !pp. CG*4H%. )ew 7or: Dsychology Dr ess.

Bangert-rowns, /. @., 2uli, 6. 6., 2uli, O. A., 1 9organ, 9. !%.

The instructional feedbac in test-lie events. Re9iew o4 Ed!'ational

Resear'h, 0",

3G* 3GC.

Begg, &., Armour, ., 1 2err, T. !CH%. >n believing what we remember.

(anadian

Jo!rnal o4 2eha9ioral S'ien'e, "= , * 3N.

Begg, &., uft, $., @alonde, D., 9elnic, /., 1 $anvito, O. !C%. 9emory

pr edictions are based on ease of processing. Jo!rnal o4 Me-ory and

Lang!age, #7,

54* 5G3.

Ben"amin, A. $. !344;%. 9emory is more than "ust remembering: $trategic control of

encoding, accessing memory, and maing decisions. &n A. $. Ben"amin 1 B. '.

/oss !(ds.%, %he &sy'hology o4 learning and -oti9ation@ S+ill and strategy in

-e-or y !se !ol. NC, pp. ;H*33G%. @ondon: Academic Dr ess.


38/41


B"or, /. A. !;H%. /etrieval as a memory modifier: An interpretation of negative

recency and related phenomena. &n /. @. $olso !(d.%, In4or-ation &ro'essing and

'ognition@ %he Loyola Sy-&osi!- !pp. 3G*NN%. )ew 7or: @% and the delayed-O>@ eff ect. Me-ory and (ognition, #5, G;N* GC4.azio, @. 2., 1 9arsh, (. O. !344%. $urprising feedbac improves later memor y.

Psy'hono-i' 2!lletin and Re9iew, "0 , CC* 3.

eller, 9. !N%. >pen-boo testing and education for the future. St!dies in Ed!*

'ational E9al!ation, #5, 3GH* 3GC.

erster, 6. B., 1 $inner, B. . !H;%. S'hed!les o4 rein4or'e-ent . )ew

7or : Appleton-6entury-6rofts.

Lates, A. &. !;%. /ecitation as a factor in memorizing. Ar'hi9es o4 Psy'holog y,

0 !N4%.

Llover, O. A. !C%. The testing phenomenon: )ot gone but nearly f or gotten.

Jo!rnal o4 Ed!'ational Psy'hology, 7", G3* G.

'asher, @., Loldstein, ., 1 Toppino, T. !;;%. re0uency and the conference of

referential validity. Jo!rnal o4 erbal Learning and erbal 2eha9ior , "0 , 4;* 3.

Oacoby, @. @., 1 'ollingshead, A. !4%. /eading student essays may be hazardous

to


39/41

N; Roediger et al. 2ene3ts o4 testing -e-ory =

your spelling: (ff ects of reading incorrectly and correctly spelled words. (anadian

Jo!rnal o4 Psy'hology, , GNH* GHC.Oones, '. (. !3G%. The eff ects of e+amination on the performance of learning.

Ar'hi9es o4 Psy'hology, "5, * ;4.2ang, $. '. 2. !344a%. (nhancing visuo-spatial learning: The benefit of r etrieval

practice. 9anuscript under r evision.2ang, $. '. 2. !344b%. The influence of te+t e+pectancy, test format and test

e+perience on study strategy selection and long-term retention. #npublished

doctoral dissertation, , #$A.

2ang, $. '. 2., 9cermott, 2. B., 1 /oediger, '. @. !344;%. Test format andcorr ective feedbac modify the eff ect of testing on long-term retention. E!ro&ean J o!rnal o4 (ogniti9e Psy'hology, ">, H3C* HHC.

2arpice, O. . !in press%. 9etacognitive control and strategy selection: eciding

to practice retrieval during learning. Jo!rnal o4 E?&eri-ental Psy'hology@ Beneral .

2arpice, O. ., Butler, A. 6., 1 /oediger, '. @. !344%. 9etacognitive strategiesin

student learning: o students practise retrieval when they study on their own=

Me-ory, "= , N;* N;.

2arpice, O. ., 1 /oediger, '. @. !344;%. (+panding retrieval practice

pr omotes short-term retention, but e0ually spaced retrieval enhances long-term

r etention. Jo!rnal o4 E?&eri-ental Psy'hology@ Learning, Me-ory, and

(ognition, $$,

;4N* ;.

2arpice, O. ., 1 /oediger, '. @. !344C%. The critical importance of retrieval

f or learning. S'ien'e, $">, 55* 5C.2elemen,


40/41

2ene3ts o4 testing -e-oryNC Roediger et al.

@eeming, . 6. !3443%. The e+am-a-day procedure improves performance in Dsych-

ology classes. %ea'hing o4 Psy'hology, #>, 34* 33.

@oftus, (. ., 9iller, . L., 1 Burns, '. O. !;C%. $emantic integration of

verbal information into a visual memory. Jo!rnal o4 E?&eri-ental Psy'hology@ H!-an Learning and Me-ory, , * G.

@ogan, O. 9., 1 Balota, . A. !344C%. (+panded vs. e0ual interval spaced r e-trieval practice: (+ploring diff erent schedules of spacing and retention intervalin younger and older adults. Aging, 8e!ro&sy'hology, and (ognition, "/, 3H;* 3C4.

9caniel, 9. A., Anderson, O. @., erbish, 9. '., 1 9orrisette, ). !344;a%. Testing

the testing eff ect in the classroom. E!ro&ean Jo!rnal o4 (ogniti9e Psy'hology, ">,

NN* HG.

9caniel, 9. A., /oediger, '. @., &&&, 1 9cermott, 2. B. !344;b%. Lener alizing

test-enhanced learning from the laboratory to the classroom. Psy'hono-i' 2!lletin

and Re9iew, ", 344* 345.

9caniel, 9. A., 1 $un, O. !344%. The testing eff ect: (+perimental evidence in acollege course. 9anuscript under r evision.

9arsh, (. O., Agarwal, D. 2., 1 /oediger, '. @., &&& !344%. 9emorial

conse0uences of answering $AT && 0uestions. Jo!rnal o4 E?&eri-ental

Psy'hology@ A&&lied , "/,

* .

9arsh, (. O., /oediger, '. @., &&&, B"or, /. A., 1 B"or, (. @. !344;%. The memorial

conse0uences of multiple-choice testing. Psy'hono-i' 2!lletin and Re9iew, ",

N* .

9etcalfe, O., 1 2ornell, ). !344G%. The dynamics of learning and allocation of

study time to a region of pro+imal learning. Jo!rnal o4 E?&eri-ental Psy'hology@Ben* eral , "$#, HG4* HN3.

9oreno, /. !344N%. ecreasing cognitive load for novice students: (ff ects of e+plana-

tory versus corrective feedbac in discovery-based multimedia. Instr!'tional

S'ien'e, $#, * G.

)elson, T. >., 1 unlosy, O. !%. ., 1 )arens, @. !C4%. )orms of G44 general-information 0uestions:

Accuracy of recall, latency of recall, and feeling-of-nowing ratings. Jo!rnal o4

erbal Learning and erbal 2eha9ior , ">, GGC* G5C.

)elson, T. >., 1 )arens, @. !4%. 9etamemory: A theoretical framewor and newfindings. &n L. '. Bower !(d.%, %he &sy'hology o4 learning and -oti9ation !ol. 35,

pp. 3H*;G%. )ew 7or: Academic Dr ess.

>degard, T. )., 1 2oen, O. . !344;%. )one of the above as a correct and incorr ectalternative on a multiple-choice test: &mplications for the testing eff ect. Me-or y,"/, C;G* CCH.

Dashler, '., 6epeda, ). O.,


41/41

2ene3ts o4 testing -e-ory >

/emmers, '. '., 1 /emmers, (. 9. !35%. The negative suggestion eff ect on true-false e+amination 0uestions. Jo!rnal o4 Ed!'ational Psy'hology, "= , H3* H5.

/ichland, @. (., 2ornell, ). 1 2ao, @. $. !344%. The pretesting eff ect: o unsuccess-

ful retrieval attempts enhance learning= Jo!rnal o4 E?&eri-ental Psy'holog y@

A&&lied , "/, 3NG* 3H;.

/oediger, '. @., 1 2arpice, O. . !3445a%. Test enhanced learning: Taing

memory tests improves long-term retention. Psy'hologi'al S'ien'e, "= , 3N* 3HH.

/oediger, '. @., 1 2arpice, O. . !3445b%. The power of testing memory:

Basic research and implications for educational practice. Pers&e'ti9es on

Psy'hologi'al S'ien'e, ", C* 34.

/oediger, '. @., 9caniel, 9. A., 9cermott, 2. B., 1 Agarwal, D. 2. !344%.

Test-enhanced learning in the classroom: The 6olumbia 9iddle $chool pro"ect.

9anuscript in pr epar ation.

/oediger, '. @., 1 9arsh, (. O. !344H%. The positive and negative conse0uences of

multiple-choice testing. Jo!rnal o4 E?&eri-ental Psy'hology@ Learning, Me-or y ,

and (ognition, $", HH*H.

/oediger, '. @., Paromb, . 9., 1 Butler, A. 6. !344C%. The role of repeated

r etrieval in shaping collective memory. &n D. Boyer and O. .

(766992060) benefits of testing memory; best practices and boundary conditions

Documents