stability in the metamemory realism of eyewitness confidence judgments

RESEARCH REPORT

Stability in the metamemory realism of eyewitness confidencejudgments

Sandra Buratti • Carl Martin Allwood •

Marcus Johansson

Received: 22 October 2012 / Accepted: 18 June 2013

� Marta Olivetti Belardinelli and Springer-Verlag Berlin Heidelberg 2013

Abstract The stability of eyewitness confidence judg-

ments over time in regard to their reported memory and

accuracy of these judgments is of interest in forensic

contexts because witnesses are often interviewed many

times. The present study investigated the stability of the

confidence judgments of memory reports of a witnessed

event and of the accuracy of these judgments over three

occasions, each separated by 1 week. Three age groups

were studied: younger children (8–9 years), older children

(10–11 years), and adults (19–31 years). A total of 93

participants viewed a short film clip and were asked to

answer directed two-alternative forced-choice questions

about the film clip and to confidence judge each answer.

Different questions about details in the film clip were used

on each of the three test occasions. Confidence as such did

not exhibit stability over time on an individual basis.

However, the difference between confidence and propor-

tion correct did exhibit stability across time, in terms of

both over/underconfidence and calibration. With respect to

age, the adults and older children exhibited more stability

than the younger children for calibration. Furthermore,

some support for instability was found with respect to the

difference between the average confidence level for correct

and incorrect answers (slope). Unexpectedly, however, the

younger children’s slope was found to be more stable than

the adults. Compared to the previous research, the present

study’s use of more advanced statistical methods provides

a more nuanced understanding of the stability of confi-

dence judgments in the eyewitness reports of children and

adults.

Keywords Confidence � Confidence accuracy �Realism of confidence � Calibration � Stability �Eyewitness memory � Metamemory

Introduction

The level and accuracy of the confidence, expressed by

eyewitnesses with respect to their testimony, are important

because their reported confidence levels are often used by

jurors and other parties in the forensic context to assess the

reliability of their memory (Boyce et al. 2007; Cutler et al.

1988; Lindsay et al. 1981; Wells and Bradfield 1999; Wells

et al. 1979). However, research has shown that the accuracy

of confidence judgments is often lacking for both event

memory reports (Allwood 2010; Allwood et al. 2005, 2008;

Leippe and Eisenstadt 2007) and line-up identifications

(Brewer 2006; Sporer et al. 1995). Eyewitnesses to a crime

often talk about their memory of the event many times: with

the police, family, and friends, as well as in court (Gabbert

et al. 2003; Lane et al. 2001; Marsh et al. 2005; Paterson and

Kemp 2006), and this may influence their memories. In

general, memory changes over time, but less is known about

the stability of witnesses’ abilities to provide accurate

confidence judgments concerning the correctness of their

memory over time. The issue of the stability of eyewitness

confidence is relevant in forensic settings because it can

help the police and courts better judge the validity of eye-

witness reports. The accuracy of the eyewitnesses’ confi-

dence in their memories is also referred to as the realism of

eyewitness confidence judgments.

S. Buratti (&) � C. M. Allwood

Department of Psychology, University of Gothenburg,

Box 500, 40530 Goteborg, Sweden

e-mail: [email protected]

M. Johansson

Department of Psychology, Lund University, Lund, Sweden

123

Cogn Process

DOI 10.1007/s10339-013-0576-y

Only a few studies have investigated the issues of how

stable the participants’ confidence and realism of confi-

dence are over time (Jonsson and Allwood 2003; Meng-

elkamp and Bannert 2010; Stankov and Crawford 1996;

Thompson and Mason 1996). In general, although all four

studies looked at monotonic stability (explained below) by

means of using correlations, the results from these studies

should be compared and generalized with caution since

they differed in design and other aspects between them-

selves and from the present study. For example, the number

of test occasions varied between two (Stankov and Craw-

ford 1996; Thompson and Mason 1996, Experiments 1–2)

and three (Jonsson and Allwood 2003; Mengelkamp and

Bannert 2010, within the same session), and the time delay

between test occasions was one or 2 weeks, except in

Mengelkamp and Bannert (2010) where it was 10–20 min.

Moreover, all studies used alternative forms of the same

tests, except Mengelkamp and Bannert (2010) who used

partly the same questions and partly different questions

over the three test occasions. The majority of studies used

adult samples, usually undergraduates, with the exception

of Thompson and Mason, who in Experiment 2 used older

adults. Finally, Thompson and Mason (1996) used a

qualitative scale in unit steps from 0 (guessing) to 3

(Experiment 1) or 4 (Experiment 2) labelled certain, and

the other three studies used a numerical confidence scale

expressed in percent. The stability of confidence, as such,

over test occasions was not analyzed by Stankov and

Crawford (1996). However, the other three studies indicate

some stability of confidence over time. For example, the

Pearson correlations between the three test occasions in

Jonsson and Allwood (2003) varied between .71 and .87 for

the two types of material used. Kelemen et al. (2000)

studied the stability of other types of metacognitive

judgments.

Research on the confidence stability has generally not

considered that stability is a multifaceted concept. Many

kinds of stability exist, and how to define stability is

debated (Asendorpf 1989; Duncan et al. 2006). For some

researchers, the term ‘‘stability’’ may be understood as no

change at all. Tisak and Meredith (1990) called this type of

stability strict stability, which means no change between

time points for an individual. Another type of stability is

monotonic stability, which implies a linear change with

individuals maintaining rank order within the group. Strict

stability can be considered a special case of monotonic

stability. The present study considered these two types of

stability; to differentiate between them, we used multilevel

modelling (MLM) analysis.

None of the previous studies investigating the realism of

eyewitness confidence over time used MLM to assess the

intra-individual stability of the realism of eyewitness con-

fidence judgments. In the present study, we investigated the

stability of the realism of eyewitness’ confidence in their

episodic memory over time using MLM to assess intra-

individual stability. We also investigated the differences in

the degree of stability between three different age groups,

including children and adults.

Measures of realism in metamemory research

Many different measures exist for assessing the realism of

confidence. Different studies use different measures, which

focus on the different aspects of realism of confidence. The

measures are often divided into two categories: absolute

and relative measures (Bornstein and Zickafoose 1999;

Lichtenstein and Fischhoff 1977; Nelson 1996; Schraw

2009). Yates (1990, 1994) has provided a general overview

of measures of the accuracy of confidence judgments.

Absolute measures are measures that assess the rela-

tionship between confidence and the proportion of correct

answers. One such measure is bias (also known as over/

underconfidence), which provides information about pos-

sible overconfidence or underconfidence. Another absolute

measure is calibration, which is the mean squared devia-

tion of confidence from the percent of correct answers in

each confidence class (0, 10, 20 %, etc.), where each

confidence class is weighted by the number of items in that

class (Lichtenstein et al. 1982). Unlike bias, calibration

does not provide information about the deviation of eye-

witness confidence from perfect realism in terms of abso-

lute values. However, because calibration performs a

squaring operation for each confidence class, the deviation

from perfect realism is penalized for each class, and thus,

overconfidence at one end of the confidence scale cannot

be cancelled out by underconfidence at the other end of the

confidence scale, as is the case for bias.

Relative measures assess how correct and incorrect

answers are discriminated by the use of different confi-

dence levels. One such measure is resolution, which does

not provide any information about the direction of the

discrimination (Lichtenstein et al. 1982). Not knowing the

direction means that the same degree of resolution may

mean that high-confidence judgments are assigned to cor-

rect answers and low-confidence judgments to incorrect

answers, or that low-confidence judgments are assigned to

correct answers and high-confidence judgments to incorrect

answers. Furthermore, the difference between the average

confidence for correct answers and the average confidence

for incorrect answers is not provided by the resolution

measure; it only states the extent to which the person can

sort correct and incorrect answers into two separate groups

using their confidence judgments. Another relative measure

originating from signal detection theory is the distance-

based metric da (see Benjamin and Diaz 2008). Yet another

relative measure is the Goodman–Kruskal gamma

Cogn Process

123

correlation, which is the proportion of concordance

between X and Y item pairs (see Nelson 1984). A final

relative measure is the Pearson correlation between confi-

dence and the correctness of reported memories (also

called the confidence/accuracy (C/A) correlation). Like the

gamma correlation, this measure shows the direction of the

discrimination but not the absolute size of the separation

between the mean confidence for correct and incorrect

answers.

Finally, slope provides a measure of the difference

between the average confidence for correct and incorrect

answers. This measure includes both a relative component

and an absolute component (i.e., information about the

eyewitness’ ability to live up to the normative goal of

maximal separation of the level of the confidence judg-

ments for correct and incorrect answers) (Yates 1990).

Thus, slope may be argued to belong to a third category of

realism measures: mixed measures. However, slope builds

on the eyewitness’ ability to discriminate between correct

and incorrect answers by using their confidence judgments,

and also gives information about the extent and direction of

this separation. For this reason, we will consider slope as a

relative measure. For example, a high positive value for

slope indicates that the level of the eyewitnesses’ confi-

dence judgment is informative about the correctness of

their memory. In contrast, if all confidence judgments are,

for example, at the 75 % confidence level and the witness

has 75 % of the answers correct, the witness shows no bias,

but also no slope. Here, the confidence judgments would

not help personnel in the justice system to discriminate

between correct and incorrect answers. These examples

also show that different measures of realism are useful

because they provide information about different aspects.

The value of various measures of confidence realism has

been debated in eyewitness research. Traditionally, the

Pearson correlation has been used in this research, espe-

cially in the research on lineups, but for nearly 20 years,

many researchers have argued that absolute measures, such

as bias, are more appropriate for eyewitness research

(Brewer 2006; Juslin et al. 1996; Wells et al. 2002, 2006;

see also Allwood 2010). Juslin et al. (1996) and Brewer

(2006) noted that perfect discrimination may demand that

the eyewitness has control of the conditions, such as the

degree of visibility, that the witness is commonly not able

to control. Thus, to ask for perfect discrimination is to ask

for the impossible. In contrast, calibration gives the witness

a chance to weigh in, for example, that the visibility,

duration, or angle of observation were poor when the ori-

ginal event was observed. Although these authors primarily

considered line-up situations, the same argument applies to

event memory in general.

Juslin et al. (1996) noted that, even in cases where

perfect calibration is present, the more the confidence

judgments are located at the ends of the scale, the better is

the correlation. Wells et al. (2006) noted that a conse-

quence is that higher C/A correlations are achieved when

different viewing conditions or different types of witnesses

are investigated compared to when more similar conditions

or types of witnesses are used. Finally, measures, such as

the C/A correlation, that only provide information about

the amount of variance explained, in addition to the

direction of the association between confidence and cor-

rectness, may be less helpful to personnel in the justice

system than measures that give information about absolute

levels of realism, such as bias, because the latter infor-

mation may be easier to understand.

In brief, different measures provide information on

different aspects of confidence realism. Because we wanted

to analyze stability for both absolute and relative accuracy,

we included bias, calibration, and slope as indicators of

confidence realism. Bias provides an absolute measure that

indicates the direction of realism (over/underconfidence).

The reason for using the calibration measure is that, in

contrast to bias, calibration calculates the squared differ-

ence within each confidence class and thus, for example,

underconfidence at one end of the confidence scale cannot

cancel out overconfidence at the other end of the scale. We

used slope because of its mentioned advantages and

because it is intuitively easy to understand.

Research on the stability and realism of confidence

judgments

To investigate confidence and confidence realism stability,

the processes on which metacognitive judgments are based

need to be considered. Metacognitive judgments are based

on different cues that stem from all information that is

activated when the judgment is made (Koriat 1993, 1994).

Such cues can be purely information-based, in that the cues

concern a judgment of one’s own knowledge or compe-

tence within a certain area (i.e., the theory we have

regarding how competent we are at, for example, answer-

ing general knowledge questions, Koriat et al. 2008). The

cues can also be experience-based and concern structural

information, such as processing fluency. For example,

answers that come easily to mind are experienced as having

high processing fluency and may be given higher confi-

dence judgments (Kelley and Lindsay 1993; Koriat 1993).

Research has shown that the levels of confidence for

different types of general knowledge tasks are highly cor-

related; thus, a confidence trait appears to exist (Kleitman

2008). Bornstein and Zickafoose (1999) found that peo-

ple’s confidence levels for a general knowledge task

moderately correlated with their confidence levels for an

eyewitness task. These results are in line with Koriat

et al.’s (2008) idea of information-based cues, and

Cogn Process

123

similarly, a general preference for high confidence might

also give rise to an information-based cue. People may

have a feeling (or notion) of competence when it comes to

recalling information and events (information-based cues),

and this feeling may influence their confidence judgments.

Consequently, basing confidence on information-based

cues may, in turn, lead to some stability in confidence

judgments and in the absolute measures of confidence

realism. Finally, the stability of bias and calibration may be

due, to some extent, to the stability of the participants’

confidence judgments as these measures include confi-

dence. As previous research has indicated stability in

confidence, this statistical effect is likely to contribute to

the stability of absolute measures of realism.

Several studies have reported the stability of absolute

measures. Stankov and Crawford (1996) found evidence for

bias stability and somewhat weaker evidence for stability in

calibration, when participants performed Raven’s progres-

sive matrices, a perceptual test, two vocabulary tests, and

two-digit span tests. Similar results were found for cali-

bration and bias for word knowledge and logical spatial

ability (Jonsson and Allwood 2003), and for learning about

operant conditioning (Mengelkamp and Bannert 2010).

Previous research has mostly covered semantic memory

retrieval, but similar processes may be valid for episodic

memory retrieval. For example, a person may believe that

he or she is very good at recalling events and base judg-

ments on this assumption, resulting in high-confidence

judgments in eyewitness tasks. Conversely, a person may

highly doubt his or her memory for events and consistently

express low confidence in eyewitness tasks. Thus, we

expect some stability in absolute measurements of realism

also for the event memory task used in this research.

However, there are also some other differences between

the situations investigated in the present study and in the

previous research on semantic memory. For example, as

described above, the previous studies mostly asked about

similar but new material on each occasion, whereas our

participants were asked different questions on each occasion

but these questions all concerned the same previous short

event. In spite of these differences, we speculated that use of

the type of information-based cues described above would be

present and used also in the context we investigated.

In addition to information-based cues, as noted above,

confidence judgments can also be influenced by structural

information or so-called experience-based cues (Koriat

et al. 2008). Such experience-based cues may differ for

judgments made at different time points after the original

event. Leonesio and Nelson (1990) argued that structural

information could differ at different points in time when

used in metacognitive judgments. Since relative measures

attend to how well confidence discriminates between cor-

rect and incorrect answers, they are likely to be affected by

experience-based cues. The changes in structural infor-

mation over time are thus likely to contribute to instability

in slope and relative measures in general. For example, a

good slope value is dependent on the confidence level for

correct answers and for incorrect answers, and these levels

may, to a large extent, be derived by using fluency and

other experience-based cues that may be influenced by

when in time the confidence judgment is made. Thus, even

though relative measures to some extent are influenced by

information-based cues, the influence of experience-based

cues on relative measures may well be larger since these

measures to a large extent can be expected to be influenced

by cues relating the correctness or incorrectness of the

specific answers. Thus, the heavy influence by experience-

based cues can be expected to lead to instability in these

measures.

In accordance with this suggestion, several studies have

shown that relative measures evidence no or little stability

and do not exhibit the same degree of stability as absolute

measures. Stankov and Crawford (1996) reported low sta-

bility for the relative measures of resolution and slope,

slope showing somewhat stronger stability than resolution.

However, Jonsson and Allwood (2003) did report some

stability for resolution in their study (they did not investi-

gate slope). Thompson and Mason (1996) reported a lack

of stability for Goodman–Kruskal gamma correlation (used

as a measure of confidence realism) in a general knowledge

task, a word recognition test, and a face-recognition task.

The same lack of stability for Goodman–Kruskal gamma

correlation and Pearson’s r and da was shown by Meng-

elkamp and Bannert (2010). In brief, various measures of

relative accuracy have been used in previous studies and

the results provide evidence of fairly low stability, or

instability, over time for these measures.

In brief, although the previous research used somewhat

different designs and differed in other aspects compared

with the present study, we expected stability for absolute

measures and a lack of stability for slope. One reason is

that slope is influenced, to a greater extent than the absolute

measures, by the level of confidence for correct and

incorrect answers. Other than the face-recognition task

used by Thompson and Mason (1996), we know of no

earlier study investigating the intra-individual stability of

realism of confidence for an eyewitness situation.

A better understanding of the degree of stability and

type of stability (strict or monotonic) of absolute and rel-

ative measures of realism in eyewitness confidence judg-

ments can help staff in the justice system better understand

the conditions for diagnosing the degree of realism asso-

ciated with the confidence judgments of an eyewitness. For

example, it is of forensic relevance whether confidence

judgments given at various points in time after the crime

event can be expected to be strictly stable, monotonically

Cogn Process

123

stable, or not stable at all. Our results pertain especially to

situations where the interviewee is asked questions that

differ between interview occasions, but where the ques-

tions all relate to the same short-duration event. Further-

more, given that the research reviewed above has found

some measures of confidence realism to be more stable

than other measures, the degree of realism of a specific

witness might be diagnosed by giving the witness a small

test using these measures of realism. Although not the

primary aim of the present study, our results also have

implications for the issue of the existence of a confidence

trait suggested by Kleitman and Stankov (2007). For

example, a finding of at least monotonic stability for con-

fidence might, depending on the other results, be inter-

preted as supportive of such a trait.

Stability in different age groups

Three age groups (8 to 9 year olds, 10 to 11 year olds, and

adults) were chosen in order to compare stability in relation

to age. Research shows that metacognitive abilities

improve with age (Kuhn and Dean 2004; Schneider 2008;

Schneider and Lockl 2002). Children younger than 8 years

old were not included in the study because the extent to

which they can understand the confidence judgment task as

presented in this research is not completely agreed upon by

the researchers (Allwood et al. 2006, 2008; Howie and

Roebers 2007; Schlottmann and Anderson 1994). The 10-

to 11-year-old age group was included in order to study the

rate at which metacognitive stability changes for children.

Comparing different age groups is of interest because

different types of cues likely form the basis of confidence

judgments. Younger children may use more experience-

based cues to make confidence judgments, as the younger a

person is, the less likely they have formed a theory

regarding their performance in similar tasks (information-

based cues). On the other hand, adults have had plenty of

time to form a theory regarding their performance in a

number of different tasks, which in turn would lead them to

make more information-based metacognitive judgments.

As children’s confidence judgments are more likely to be

based on the experience-based cues, younger children may

be expected to show more instability in both confidence

and absolute accuracy measures, as well as relative accu-

racy measures, such as slope. We know of no previous

study that has investigated differences in the degree of

stability in confidence realism between age groups.

Multilevel modelling

Recently, a number of new statistical techniques have

emerged that can be used to investigate stability and

change over time. MLM is one such technique, allowing

researchers to investigate whether the time variable is

associated with the level of confidence realism (Rauden-

bush 2001). MLM also allows the researchers to make

inferences about the within-person variance (i.e., the intra-

individual variation for a person over time) as well as the

between-person variance (i.e., differences between people).

For example, age group can be used as a predictor for

investigating differences in change patterns in the mea-

sured variable across time.

Kwok et al. (2008) reviewed several reasons why MLM

techniques are preferred over repeated measures ANOVA

when using a repeated measures design. For example,

MLM allows the estimation of individual-level trends over

time. In contrast, repeated measures ANOVA estimates an

average trend for all participants and treats the individual

variance as unexplained error. Thus, individual variance is

not a problem for MLM, but it is considered noise in a

repeated measures ANOVA. When applying a repeated

measures design with data collected at three time points,

MLM provides a possibility of estimating the individual

differences in rates of change.

Aims

The purpose of the present study was to investigate the

level of stability across time in the proportion of correct

memory reports, confidence, and realism as measured by

bias, calibration, and slope for memory reports of a wit-

nessed event. We looked for evidence of intra-individual

stability (strict or monotonic) for any of the dependent

measures across time. If a measure shows strict stability,

a time-change model cannot be fitted, that is, there should

be no systematic change predicted by time. Also, for a

measure that shows strict stability, there should not exist

any non-systematic change due to other known and

unknown variables. If there exists monotonic stability,

then there could be a linear fixed effect of time, that is,

either a linear increase or a decrease over time for all

individuals. However, there should not be a random effect

of time indicating that individuals’ rank order is being

violated. In addition, we investigated whether a difference

exists in the degree of stability between different age

groups.

Based on the theory regarding information-based and

experience-based cues by Koriat et al. (2008) and the

results from previous studies, we hypothesized that

1. The confidence level will show monotonic stability

over time. We did not expect strict stability, since

confidence level is dependent on the correctness of the

answers, which might change over time.

2. The absolute measures of realism (i.e., bias and

calibration) will show strict stability over time. Since

Cogn Process

123

we did not find any convincing reason why the

absolute confidence realism measures would differ

between the occasions, we did not just expect mono-

tonic stability for these measures.

3. Slope, a relative measure of realism, will not be stable

over time.

4. The adults will show a higher degree of stability in

confidence and the two absolute metamemory mea-

sures than the two child groups. Two general reasons

for expecting higher stability for the adults are that

children may be more driven by experience-based cues

than adults and that children may have less stable

cognitive systems (see, e.g., Knutsson et al. 2011).

Method

Participants

A total of 32 children aged 8–9 years (younger children,

Mdage = 9), 31 children aged 10–11 years (older children,

Mdage = 11), and 30 adults (Mdage = 23, ranging from 19

to 31 years of age) participated in the study. Four partici-

pants (one older child and three adults) did not complete all

required parts of the study. Three of the participants

completed only two of the three sets of questions, and one

participant completed only one of the three sets of ques-

tions. The incomplete data were analyzed using the Little’s

MCAR test and found to be missing completely at random.

Therefore, the incomplete data from these four participants

were imputed with an EM logarithm.

Design

The three groups each participated on four occasions,

8 days apart, three of which were retrieval occasions.

Three sets of 20 questions were counterbalanced across

participants and retrieval occasions. No ordering effect was

found on any of the measures. A pilot study found that the

degree of difficulty was equal across the three sets. Also, a

repeated measures ANOVA of the data in this study found

that there was no significant difference in difficulty

between the three sets of questions (F(2, 184) = .30,

p = .738).

Materials

Film clip

The participants watched a film clip (3 min and 4 s long, in

color) depicting a street crossing at a bus central in the

center of Lund, Sweden. The film clip showed a street

scene, including shops and other buildings, with people and

vehicles passing. Given that witnesses sometimes give

testimony simply about the presence of, for example,

people or cars in specific settings (locations at specific

times), this film was judged to be relevant to our research.

Questionnaire

Three sets of questions, each with 20 different two-alter-

native forced-choice questions about the content of the

film, were used. The questions about the film’s content

pertained to aspects such as actions, persons, and physical

objects. Different questions were used at each measure-

ment point in order to, as far as possible given the design of

the present study, avoid confounding the effect of stability

with repeated questions. For each question, one of the two

answer alternatives was always correct. An example of a

question is: ‘‘What did the boy with the skateboard have in

his hand: (a) a bottle or (b) a plastic bag?’’ For each

question, the participant rated his or her level of confidence

on a scale ranging from 50 % (‘‘guessing’’) to 100 % (‘‘I’m

absolutely sure that my answer is correct’’), with 10 %

increments rendering a total of six confidence classes.

Procedure

The children participating in the study were recruited at a

school located in the far south of Sweden. The children’s

parents received information about the study via a letter

and returned a response form stating whether or not they

gave their consent for their child to take part in the study.

The adult participants were students at Lund University,

Sweden. When recruited, the participants were told they

would watch a short film clip and then answer questions

about the content of the film on three occasions.

First, the participants watched the film clip. Eight days

after watching the film clip, the participants took part in the

first of the three paper-and-pencil tests. On each test

occasion, the participants completed one of the three sets of

20 questions and made confidence judgments. The partic-

ipants were tested in groups of 2–3 individuals and could

not see each other’s responses as they were seated back to

back.

On the first occasion, the experimenter used one very

easy and one very difficult practice question to inform the

participants about the confidence scale. As predicted, all

children made the appropriate confidence responses to both

the easy and the difficult question. In this context, we

informed the children that the 50 % confidence level cor-

responded to guessing. The children were also told that if

they were neither guessing nor absolutely certain, they

were free to choose an appropriate confidence among the

remaining four confidence levels (60, 70, 80, 90 %).

Cogn Process

123

In previous research, this type of scale has been success-

fully used in with children as young as 8 years (Allwood

et al. 2008).

The adult participants were informed that the instruc-

tions and questions had been designed to suit young chil-

dren who also participated in the study. When informing

the adult participants about how to answer the questions

and to make confidence judgments, we used practice

questions more suited to adults. Each participant was asked

not to discuss the film or the questions about its content

with other persons, regardless of whether those persons

participated in the study.

The participating children’s school was rewarded

approximately 460 USD, and each of the adult participants

was awarded approximately 15 USD for partaking in the

study.

Results

Preliminary analyses found no significant difference

between the sets of questions for any of the dependent

measures (proportion correct, confidence, bias, calibration,

and slope). Reliability analyses were carried out for the raw

scores: proportion correct and confidence. Due to negative

inter-item covariance for proportion correct, the assump-

tions for reliability testing was not met and no alpha values

can be reported for the three sets of questions. This may be

due to several different aspects, for example, the binary

nature of the data may have caused a restricted spread. For

confidence, Cronbach alpha was .93, .96, and .96 for the

three sets of questions.

Independent variables and scoring procedure

for dependent measures

The independent time variable, week, was centered at the

first measurement point (i.e., the first retrieval occasion),

which was used as the reference point in the analysis (week

1). Because of the short delay between the measurement

occasions, the ages of the participants did not change

during the study and therefore age group was treated as a

constant between-person (Level 2) variable. The between-

person (Level 2) variable, sex, was also evaluated.

In addition to the proportion of correct answers and

confidence, we used several metacognitive outcome mea-

sures following the recommendations of Schraw (2009).

More specifically, three such measures were assessed: bias,

calibration, and slope. Bias is calculated by subtracting a

person’s proportion of correct answers from their average

confidence judgment. A value near zero indicates that the

person is neither over- nor underconfident. Calibration is

calculated using the following formula:

Calibration ¼ 1=nXT

t¼1

nt rtm � ctð Þ2

Here, n is the total number of questions judged, T is the

number of confidence classes used, nt is the number of

confidence judgments within confidence class rt, rtm is the

mean confidence level in confidence class rt, and ct is the

percent of correct answers within confidence class rt. For

each confidence class, the percent of correct answers within

that class is subtracted from the mean level of confidence

within that class. This difference between the mean

confidence and percent of correct answers is squared and

multiplied by the number of times the confidence class was

used by the participant. The resulting products for each

confidence class are summed, and the sum is divided by the

total number of questions (e.g., Lichtenstein et al. 1982).

The third measure was slope, which indicates how well

the participants differentiated between correct and incor-

rect answers. Slope is calculated by subtracting the average

confidence for incorrect items from the average confidence

for correct items. If a person has low confidence for

incorrect answers and high confidence for correct answers,

it would indicate good separation and, thus, a good slope.

In the present study, the confidence scale ranged from 50 to

100 %. Thus, a slope value of .5 indicates perfect separa-

tion, 0 indicates no separation at all, and -.5 indicates

faulty separation (high-confidence judgments for incorrect

items and low-confidence judgments for correct items).

Analyses

To test the first three hypotheses, data were modelled using

the mixed procedure in the SPSS software. An uncondi-

tional means model, in which time is not added as an

independent variable, was estimated and intraclass corre-

lation (ICC) calculated. The ICC is the proportion of

between-person variance of the total variance (both the

between- and the within-person variance) and is calculated

by taking the between-person variance and dividing it by

the sum of the between- and within-person variance. The

ICC is thus also interpreted as the within-person correlation

of the dependent variable across measurement points since

the higher the proportion of the between-person variance,

the lower the within-person variance between measurement

points (Quene and van den Bergh 2004). Thus, a high ICC

indicates high intra-individual stability. These model esti-

mations and ICC calculations were done for each of the

five dependent measures: proportion of correct memory

reports, confidence, bias, calibration, and slope. After the

unconditional model and the ICC calculations were made,

the time models were estimated using p-values to evaluate

the significance of fixed effects. The fit of the random

Cogn Process

123

effects model was evaluated using the v2 test to determine

the significance of the different models’ -2 restricted log-

likelihood (REML) values. To further investigate the sta-

bility of the dependent measures, the individual trajectories

were assessed graphically, as a failure to fit a model

showing that time predicts change does not necessarily

mean that a dependent measure is stable. There may still

exist non-systematic changes that are not predicted by the

variable of time. Thus, a dependent measure can be

unstable due to two types of changes, namely systematic

(time predicts change with fixed or random effects) or non-

systematic (other known or unknown variables). To

investigate whether age group can explain differences in

patterns of change in the measured variable across time,

this predictor was added to all models as well. The pre-

dictor sex was added to all models but had no significant

effect and was therefore excluded from the final models.

To test the fourth hypothesis regarding differences in the

degree of stability between the different age groups, a

range value was calculated for each individual by taking

his or her maximum score of the three measurement points

for a measure and subtracting the individual’s minimum

score for the same measure. One-way ANOVAs were

conducted for each dependent measure using the between-

groups variable of age group. ANOVAs were used since

differences in degree of stability between age groups can-

not be investigated using MLM.

Descriptives of the dependent measures are given in

Table 1. The adults were, on average, slightly undercon-

fident across all three measurement points. In contrast, the

older and younger children exhibited overconfidence across

the three measurement points. All three age groups showed

low separation in their confidence judgments as indicated

by the slope measure.

Assessing intra-individual stability

As the Goodman–Kruskal gamma correlation has been

used as a measure of stability across time in earlier studies

(e.g., Mengelkamp and Bannert 2010), it was calculated

between the different measurement points. The overall

Goodman–Kruskal gamma correlations between weeks

1–2, 2–3, and 1–3 for each dependent measure are provided

in Table 2. Corresponding Spearman correlations showed

similar patterns of results and are therefore not reported

here. Estimates and standard deviations for the fitted

models for the proportion of correct answers, confidence,

bias, calibration, and slope are provided in Table 3.

Table 1 Mean and standard deviation (SD) values for adults, older

children (10–11 years), and younger children (8–9 years) on pro-

portion correct, confidence, bias, calibration, and slope across three

time points

Week 1 Week 2 Week 3

M SD M SD M SD

Proportion correct .589 .107 .589 .107 .578 .116

Adults .640 .089 .632 .088 .646 .094

Older children .595 .107 .567 .094 .562 .092

Younger children .537 .100 .569 .125 .531 .129

Confidence .672 .111 .667 .128 .663 .125

Adults .620 .065 .598 .075 .598 .079

Older children .684 .100 .673 .104 .677 .113


Bias .080 .155 .076 .176 .082 .180

Adults -.022 .093 -.036 .106 -.050 .114

Older children .088 .141 .105 .141 .113 .159


Calibration .078 .065 .075 .073 .077 .065

Adults .049 .043 .041 .024 .039 .028

Older children .079 .054 .079 .059 .081 .057


Slope .027 .064 .021 .061 .018 .052

Adults .046 .080 .049 .070 .045 .053

Older children .021 .054 .007 .045 .014 .044

Younger children .016 .055 .009 .058 -.004 .047

Table 2 Goodman–Kruskal gamma correlations between week 1–2,

week 2–3, and week 1–3 for adults, older children (10–11 years), and

younger children (8–9 years) for the different dependent variables

Week 1–2 Week 2–3 Week 1–3

Proportion correct .123 -.027 .026

Adults -.283 .000 -.231

Older children .044 -.071 .048

Younger children .318* -.198 -.186

Confidence .620*** .668*** .540***

Adults .220 .460*** .191

Older children .670*** .681*** .563***

Younger children .753*** .688*** .611***

Bias .382*** .378*** .352***

Adults -.010 .256* .138

Older children .218 .300** .303*

Younger children .549*** .279* .172

Calibration .387*** .296*** .244***

Adults .219 -.242* -.149

Older children .253* .365* .303*

Younger children .565*** .290* .159

Slope .003 .108 .023

Adults -.151 -.071 -.127

Older children -.090 .096 -.068

Younger children .177 .079 .074

p-values indicate the significance for correlations between weeks

* p \ .05, ** p \ .01, *** p \ .001

Cogn Process

123

Proportion of correct answers

The ICC for the proportion of correct answers for all

memory reports indicated that the proportions correct

correlated poorly across the three different time points.

Accordingly, only 5 % of the variance was the result of

between-person differences in the scores for the proportion

of correct answers. Consequently, 95 % of the variance in

the scores for the proportion of correct memory reports was

due to within-person differences across the three mea-

surement points. No significant fixed linear or random

linear effects of time on the scores for the proportion of

correct memory reports were found, that is, we found no

systematic change in the proportion of correct memory

reports that could be accounted for by the time variable.

When age group was added to the unconditional model as a

predictor, SPSS failed to estimate the group values, prob-

ably because of the small ICC.

Although the failure to fit a model of linear change

should indicate stability, the low ICC indicates that the

intra-individual score for the proportion of correct memory

reports was not stable for the majority of participants. This

indication of poor stability was also supported by graphical

examinations of the trajectories which showed that a

majority of the participants differed considerably between

the three measurement points. In total, the mean variation

in proportion correct answers was 17.5 %.

Confidence

The between-person variance of confidence accounted for

78 % of the total variance, indicating that the participants’

level of confidence was highly correlated between the three

different time points. When estimating the model, no sig-

nificant fixed linear effect of time on confidence scores was

found; the time variable did not account for any systematic

change in confidence on the group level. However, a sig-

nificant random linear effect of time was found (REML v2

difference (2) = 11.515, p \ .01). This indicates that there

were significant differences between individuals in the

patterns of linear change observed. To further investigate

this finding, confidence intervals (CI) were calculated for

the random effects. For 95 % of the individuals, the tra-

jectory was predicted to be between -6.72 % units and

5.92 % units, indicating that the level of confidence

declined over time for some individuals and increased for

others. Thus, participants need to be described by different

trajectories. When graphically examining the trajectories, it

was also evident that there were participants that did not

follow these patterns of increase and decrease but rather

fluctuated. The mean variation in confidence between the

measurement points was 8.6 %. When age group was

added as a predictor, a significant main effect of age group

(F(2, 90.01) = 8.48, p \ .001) indicated a difference in

confidence scores between all three age groups at the first

Table 3 Longitudinal models for the outcome measures of proportion correct, confidence, bias, calibration and slope for adults, older children

(10–11 years), and younger children (8–9 years)

Proportion correct Confidence Bias Calibration Slope

Est. SE Est. SE Est. SE Est. SE Est. SE

Fixed effects

Intercept (c00) .639*** .011 .611*** .019 -.036 .021 .043*** .009 .046 .006

Week (c10) -.004 .005

Age group (c01)

Adultsa b b 0 0 0 0 0 0 b b

Older children b b .071** .026 .138*** .030 .037*** .012 b b

Young children b b .106*** .026 .200*** .030 .063*** .012 b b

Random effects

Residual var. eti b .002*** .000 .013*** .002*** b b

Intercept var. U0i b .011*** .002 .009*** .001*** b b

Week variance U1i .001** .000

Int.-week covar. (U0i,U1i) -.001 .001

REML deviance - 458.232 -583.040 -295.962 -767.005

Week is centered at week 1. Est. = estimate, SE = standard error, Residual var. = the within-person variance, Intercept var. = the intercept

variance, Week var. = the week variance, and Int.-week covar. = the covariance between intercept and week. REML deviance = -2 restricted

log likelihooda Adults is the reference point of age groups, b SPSS was not able to calculate the estimate

* p \ .05, ** p \ .01, *** p \ .001

Cogn Process

123

measurement point. The younger children scored 10.6 %

units (SE = 2.6 %) higher on the confidence measure than

the adults, and the older children scored 7.0 % units

(SE = 2.6 %) higher than the adults on the same measure.

This result indicates that the different age groups need

different intercepts. The final model was:

Level 1 : Confidenceti ¼ b0i þ b1t Weektið Þ þ eti

Level 2 : b0i ¼ c00þc01 Age� groupð Þ þ U0i

b1i ¼ c10 þ U1i

In this model, Level 1 indicates the within-person

effects and Level 2 renders the between-person effects. As

seen in the model, confidence is a function of the expected

value of an individual i on the b0i intercept, the b1t

trajectory, which refers to an estimated value of linear

change for an individual as a function of Weekti, and eti,

which is the within-person variance. Furthermore, the

expected value for an individual i(b 0i) is a function of the

fixed intercept (c00), which is the mean confidence level at

the first measurement point, the age group (c01), and the

between-person variance in the intercept (U0i). In addition,

b1t is a function of the mean (nonsignificant) linear change

in time, that is, the fixed effect of Week (c10) and the

variation in change between individuals across weeks, the

Week variance (U1i). In brief, the results indicate

instability in the individual participants’ confidence over

time.

Bias

The between-person variance of bias accounted for 54 %

of the total variance. Consequently, the bias measure

moderately correlated across the three measurement points.

While estimating the model, no significant fixed linear or

random linear effect of time on bias measure was found.

When the predictor age was added to the unconditional

model, a significant effect was found (F(2, 90) = 25.00,

p \ .001). Younger children (.20, SE = .03) and older

children (.14, SE = .3) were significantly more biased at

the beginning of the study (at the first measurement point)

than adults.

The failure to estimate a time-change model indicates

that the intra-individual stability of the bias measure is

high. This indication was supported when each individual’s

trajectory was examined graphically, which showed that

the participants did not differ much between the three

measurement points. The mean variation of the bias score

was .19 (it should be noted that bias has a larger scale span

with possible values from -1 to 1, unlike the calibration

measure). Therefore, we found evidence that the level of

bias is strictly stable (see also Tisak and Meredith 1990),

that is, it does not differ significantly for individuals over

time.

Calibration

Intraclass correlation was calculated and the between-

person variance of calibration accounted for 45 % of the

total variance. Consequently, calibration moderately cor-

related between the three different time points. No sig-

nificant fixed linear effect or random linear effect of time

on calibration was found. When the predictor age group

was added to the unconditional model, a significant effect

was found (F(2, 90) = 13.48, p \ .001). Younger children

(.06, SE = .01) and older children (.04, SE = .01) were

significantly less well calibrated than adults at the begin-

ning of the study (at the first measurement point). The

failure of fitting a time-change model indicates intra-

individual stability in the calibration measure over time.

This finding was also supported by a graphical examina-

tion of each individual’s trajectory, which indicated that a

majority of the participants did not differ much between

the three measurement points. The mean difference in

calibration was .07. Therefore, we found evidence that the

level of calibration is strictly stable, that is, it does not

differ significantly for individuals over time.

Slope

The ICC was calculated and the between-person variance

of the slope measure accounted for 6 % of the total vari-

ance. Consequently, the within-person correlation of slope

for the different measurement points was low. When esti-

mating the model, no fixed linear or random linear effect of

time on the slope measure was found. When age group was

added to the unconditional model as a separate predictor,

SPSS failed to estimate the group values, probably because

of the small ICC. Although fitting a time-change model

failed, the low ICC indicated that the slope measure was

not very stable over time. This finding was supported when

the trajectories were examined graphically, since a major-

ity of participants differed between the three measurement

points. However, the mean variation was only .09, which

can be the result of different trajectories cancelling out

each other. Thus, the findings regarding the slope measures

stability are mixed. Although the failure to fit a time model

can be an indication of strict stability, the low ICC indi-

cates that there exists unsystematic change, which also

would lead to a failure to fit a time-change model. This

unsystematic change in the slope measure as indicated by

the low ICC shows that the slope measure shows tenden-

cies for being less stable than the two absolute measures.

Differences in degree of stability between age groups

In order to test hypothesis 4, the average range value for

each age group was calculated as described above. These

Cogn Process

123

values are provided in Table 4. A lower value indicates

lesser range and higher stability across time points. One-

way ANOVAs showed no differences in the range values

between the age groups for the proportion of correct

answers, confidence, and bias. A significant difference in

stability between the age groups was found for calibration

(F(2, 57.802) = 6.164, p = .004, g2 = .146). Because the

assumption of homogeneity for calibration was not met, the

Welch formula was used in the ANOVA above and a

Games–Howell post hoc analysis was conducted. The

younger children were significantly less stable than the

adults (p \ .01) and the older children (p \ .05). A sig-

nificant difference in stability was found between the age

groups for slope (F(2, 90) = 3.89, p \ .05, g2 = .086).

Bonferroni post hoc analysis revealed that the adults

exhibited a significantly less stable slope value than the

younger children (p \ .05).

Discussion

In the present study, we investigated the intra-individual

stability of confidence judgments and the realism of those

judgments over time. As a background to this issue, we first

discuss whether the age groups differed in confidence and

the confidence realism measures. However, since adults

were the reference group, the differences found between

the child groups for the confidence, bias, and calibration

measure are only descriptive in nature. Overall, the results

indicated a difference in the age groups in regard to the

level of confidence. The adults were significantly less

confident than the children. The youngest children

appeared to be the most confident age group. We found

slight underconfidence in the adult group and fairly marked

overconfidence in the two groups of children. The adults

were also better calibrated than the older children, and they

in turn were better calibrated than the younger children.

These results are in line with previous research (Allwood

et al. 2008; Knutsson et al. 2011). The results for slope, due

predominantly to small correct–incorrect differences,

indicated that all groups differentiated poorly between

correct and incorrect answers at all time points.

Our first hypothesis that the participants’ confidence

level would be stable across time was not confirmed

because the confidence level significantly increased for

some individuals but decreased for other individuals,

indicating that confidence may not be as stable as earlier

studies have reported. However, it is important to note that

the context investigated in the present study differs from

the contexts that have been investigated in previous

research. The present study may be the first to investigate

the stability of confidence and measures of confidence

realism when individuals answer questions on different

occasions about different details of the same event. The

fact that the participants answered different questions at the

three occasions may have led to the differences in confi-

dence level found in this study. Further research using the

present methodology should investigate whether this find-

ing generalizes to other contexts for confidence judgments.

If valid, the finding that confidence shows instability

over time in the context analyzed in the present study is of

interest and would probably not have been detected if the

analysis had been carried out only on a group level as in

previous research; the individuals whose confidence level

increased with time would have cancelled out those whose

confidence level decreased with time. In general, stability

is a question of degree and our results show that confidence

may be somewhat less stable over time than reported in

previous research. A task for future research is to investi-

gate the extent to which the confidence stability found by

Jonsson and Allwood (2003) depends on the differences in

the memory tasks used (word knowledge and logical/spa-

tial ability) and/or the different type of statistical analyses

employed. Yet, another task for future research may be to

investigate inter-individual differences in the stability of

confidence judgments in the context of different personal-

ity traits. By incorporating measures of personality traits

into the multilevel modelling, it might be possible to

explain why some persons’ confidence increase while

others’ decrease over time.

Speculatively, the reason why we did not find monotonic

stability for confidence could be that confidence to some

part is based on the experience-based cues and that these

experience-based cues in this paradigm, where different

questions were asked at each occasion, might have led to

differences in the rank order correlation.

In line with our second hypothesis predicting that bias

and calibration would exhibit strict stability over time, our

results did show strict intra-individual stability for the

absolute measures of confidence realism. On a general

level, these results are in line with previous studies

Table 4 Mean and standard deviation (SD) of the range values for

each measure for the adults, older children (10–11 years), and

younger children (8–9 years)

Measure Adults Older children Younger children

M SD M SD M SD

Proportion correct .163 .088 .164 .082 .198 .137

Confidence .093 .049 .075 .042 .096 .084

Bias .167 .085 .189 .103 .214 .145

Calibration .050 .039 .065 .044 .104 .076

Slope .116 .087 .080 .046 .075 .047

The range value was calculated for each individual by taking his or

her maximum score of the three measurement points for a measure

and subtracting the individual’s minimum score on the same measure

Cogn Process

123

reviewed above (e.g., Mengelkamp and Bannert 2010).

However, our finding of strict stability for the absolute

confidence measures adds to the previous research in this

context since earlier studies only investigated monotonic

stability, not strict stability. Assuming that our findings of

strict stability hold up in future research, strict stability for

absolute confidence realism may be somewhat more

directly useful for justice professionals than a finding of

monotonic stability would have been. In addition, our study

differed from the Mengelkamp and Bannert (2010) study in

that we measured the groups at three different time points,

meaning that we were able to explore whether any linear or

quadratic trends exist in the metacognitive measures (no

such trends were found). The stability of the absolute

measures is in line with our suggestion that information-

based cues might play a larger role in these measures than

experience-based cues (Koriat et al. 2008). As described

earlier, information-based cues might involve conceptions

(or ‘‘theories’’) about one’s own competences relevant for

the task and cues concerning such conceptions might be

more stable over time compared with experience-based

cues such as fluency that have been suggested to show

variability over time (Leonesio and Nelson 1990).

In the Introduction, we suggested that the stability of

bias and calibration may be enhanced by the fact that they

may inherit the degree of stability evident in the confidence

judgments. The reason for this expectation was that con-

fidence is a component of the bias and calibration mea-

sures, and that confidence on the basis of previous research

was expected to show stability. However, in contrast to

earlier research that utilized less advanced statistical

methods, we found fairly poor stability in confidence

judgments, which suggests that quasi-inheritance of sta-

bility from the confidence judgments may not be a good

explanation for the stability found in the absolute measures

in the present study. Similarly, the non-stability of pro-

portion correct answers shows that inherited variance from

proportion correct answers also does not help explain the

stability of bias and calibration.

Finally, in the context of our second hypothesis, we note

that the strict stability of realism identified among the

children for bias was not associated with good realism

considering that, on average, both the older and younger

children were overconfident. Thus, over time, the bias

shown by the three groups did not lessen or increase.

The third hypothesis was that the relative measure of

realism of confidence, that is, slope, would not show sta-

bility. In line with most previous research (e.g., Mengelk-

amp and Bannert 2010), we found some support for this

hypothesis since the ICC was low for the slope measures.

The participants’ performance with respect to slope is not

under the participants’ direct control to the same extent as

the absolute confidence accuracy measures, and this may

have promoted instability. To elaborate, improving one’s

general bias might simply involve adjusting one’s general

level of confidence (e.g., simply decreasing one’s confi-

dence if overconfidence is at hand), whereas improving

one’s slope may involve identifying when one’s answer is

correct or incorrect (for example, by making better use of

information to determine whether the answer is correct) by

using different types of cues considered relevant to the

task, and then lowering or increasing one’s confidence as

appropriate after considering these cues (Yates 1990). This

variability in possible cues may contribute to a lesser

degree of stability in slope compared to the measures of

absolute confidence realism. In addition, the described lack

of stability in slope corresponds to experience-based cues

playing a more important role for relative measures,

including slope, compared to more constant information-

based cues (Koriat et al. 2008). Experience-based cues,

such as fluency, have been suggested to show variability

over time (Leonesio and Nelson 1990), which may have

contributed to the instability of slope. Therefore, the reason

why we did not find monotonic stability for slope could be

that the slope is more driven by experience-based cues.

However, this is only a speculation and further research is

needed on this issue.

Our fourth hypothesis asserted that the adults would

show a higher degree of stability than the two groups of

children for confidence and realism. This hypothesis was

confirmed for calibration. The possibility that adults had an

ability to base their confidence judgments more on theory-

based cues (i.e., information-based cues) than children

(Koriat et al. 2008) may help explain this result. However,

no significant difference was found in the degree of sta-

bility between the adults and older children for bias, and

surprisingly, younger children were found to be less

unstable than adults in regard to slope. We do not have an

explanation for the last result, but interestingly the results

for slope showed very poor performance for all groups,

especially the children. The greater instability for adults

may be explained by a floor effect for the children.

Finally, when comparing the results for the Goodman–

Kruskal gamma correlations for our adult subsample across

the three time points to the Pearson correlations reported by

Jonsson and Allwood (2003) for their 18-year-old partici-

pants, we mostly found differences. First, we found low or

average correlations between the levels of confidence at the

three measurement points, whereas Jonsson and Allwood

(2003) found high correlations. Second, we found low

correlations for bias and even some negative correlations

for calibration between the three measurements, whereas

Jonsson and Allwood (2003) found medium or somewhat

high correlations for both these measures. However, a

further interesting difference can be noted. Though the

proportion of correct answers did not correlate between any

Cogn Process

123

of the time points in our study and even correlated (non-

significantly) negatively, these correlations were fairly high

in the results of the Jonsson and Allwood (2003) study.

This difference in results might help to explain the differ-

ences in the other correlations between the studies, and

speculatively, the difference itself could possibly be

explained by the fact that the type of knowledge measured

by Jonsson and Allwood (2003) (word knowledge and

logical/spatial ability) might be more available in memory

than the fairly neutral events in the film clip used in the

present study.

Future research should investigate more thoroughly the

extent to which differences in the type of retrieved mem-

ories contribute differently to stability over time for con-

fidence and various metacognitive measures. For example,

Perfect (2002) argued that important differences exist

between episodic and semantic memory with respect to

confidence realism. Such differences might render different

results for metacognitive stability for the different types of

memory material.

As described above, different questions were used at the

three time points. This was done in order to avoid the

testing effect, the effect that active repetition tends to

improve the correctness of later recall (Roediger and

Karpicke 2006a, b), and to avoid the influence of the

reiteration effect (Hertwig et al. 1997), an increase in a

person’s confidence in an assertion when they repeat it

(Odinot et al. 2009). Instead, we were interested in inves-

tigating how stable metamemory realism is across time

when the testing and reiteration effects are controlled.

However, it should be noted that the repetition effect may

not have been completely avoided since memories from the

same witnessed event was accessed on all three occasions.

Still, the results from this study have implications with

respect to the effects of the point in time after witnessing a

crime that the interview is administered (e.g., in close

proximity to when a crime was witnessed or some time

later) on metamemory realism.

Interestingly, the questions showed negative inter-item

covariance when it came to proportion correct. This is

usually the case when items have been incorrectly coded

which is not the case in this study. However, it could be

that the items’ difficulty level was not consistent, violating

the assumption of equal error variance that is required for

reliability testing (Sijtsma 2009). What is interesting is the

high reliability values found for confidence, supporting the

notion of a confidence trait suggested by researchers such

as Kleitman and Stankov (2007).

Some limitations of the present study should be noted.

First, similar to previous studies on confidence stability,

our memory task was a recognition task, and thus, the

stability found for some of the measures in the present

study may not generalize to an open free recall task or

to a memory task involving the recall of answers to

directed open questions. Moreover, the memory ques-

tions were new at each of the three occasions; thus, our

results may not generalize to situations where the same

questions are repeated on many occasions. Therefore, the

results from this study need to be replicated, preferably

with a larger sample and also with other types of

memory tasks and situations. Second, the present study

investigated stability over a time period of only 3 weeks.

In order to increase the relevance of our results for

forensic situations in which witnesses might give their

testimony, for example, half a year, or a year, after they

witnessed an event, it would be interesting to study

stability in confidence and metamemory realism over

longer time periods. Finally, it is not clear why cor-

rectness showed poor stability over the three measure-

ments in our study. To some extent, the use of different

knowledge questions at each of the three measurement

points (although these were controlled for level of dif-

ficulty) may have contributed to the lack of stability in

correctness. However, further analyses regarding the

effect of the order of the sets of questions did not reveal

any significant differences for the proportion of correct

answers, confidence level, or any of the realism measures

at any of the different time points.

Acknowledgments This study was partially funded by the Crime

Victim Compensation and Support Authority and partially by the

Swedish Research Council (VR) with grants to the second author.

References

Allwood CM (2010) Eyewitness confidence. In: Granhag PA (ed)

Forensic psychology in context. Willan Publishing, Devon,

pp 281–303

Allwood CM, Ask K, Granhag PA (2005) The cognitive interview:

effects on the realism in witnesses’ confidence in their free

recall. Psychol Crime Law 11:183–198. doi:10.1080/10683160

512331329943

Allwood CM, Granhag PA, Jonsson AC (2006) Child witnesses’

metamemory realism. Scand J Psychol 47:461–470. doi:

10.1111/j.1467-9450.2006.00530.x

Allwood CM, Innes-Ker AH, Homgren J, Fredin G (2008) Children’s

and adults’ realism in their event-recall confidence in responses

to free recall and focused questions. Psychol Crime Law

14:529–547. doi:10.1080/10683160801961231

Asendorpf JB (1989) Individual, differential, and aggregate stability

of social competence. In: Schneider BH, Attili G, Nadel J,

Weissberg R (eds) Social competence in developmental per-

spective. Kluwer Academic Publishers, Dordrecht, pp 71–86

Benjamin AS, Diaz M (2008) Measurement of relative metamne-

monic accuracy. In: Dunlosky J, Bjork RA (eds) Handbook of

metamemory and memory. Psychology Press, New York,

pp 73–94

Bornstein BH, Zickafoose DJ (1999) ‘‘I know I know it, I know I saw

it’’: the stability of the confidence–accuracy relationship across

domains. J Exp Psychol Appl 5:76–88. doi:10.1037/1076-898X.

5.1.76

Cogn Process

123

http://dx.doi.org/10.1080/10683160512331329943

http://dx.doi.org/10.1080/10683160512331329943

http://dx.doi.org/10.1111/j.1467-9450.2006.00530.x

http://dx.doi.org/10.1080/10683160801961231

http://dx.doi.org/10.1037/1076-898X.5.1.76

http://dx.doi.org/10.1037/1076-898X.5.1.76

Boyce M, Beaudry JL, Lindsay RCL (2007) Belief of eyewitness

identification evidence. In: Lindsay RC, Ross DF, Don Read J,

Toglia MP (eds) Handbook of eyewitness psychology, vol 2.,

Memory for peopleLawrence Erlbaum Associates, Mahwah,

pp 501–525

Brewer N (2006) Uses and abuses of eyewitness identification

confidence. Leg Criminol Psychol 11:3–23. doi:10.1348/135532

505X79672

Cutler BL, Penrod SD, Stuve TE (1988) Juror decision making in

eyewitness identification cases. Law Hum Behav 12:41–55. doi:

10.1007/BF01064273

Duncan TE, Duncan SC, Strycker LA (2006) An introduction to latent

variable growth curve modelling: concepts, issues, and applica-

tions, 2nd edn. Lawrence Erlbaum Associates, Mahwah

Gabbert F, Memon A, Allan K (2003) Memory conformity: can

eyewitnesses influence each other’s memories for an event? Appl

Cogn Psychol 17:533–543. doi:10.1002/acp.885

Hertwig R, Gigerenzer G, Hoffrage U (1997) The reiteration effect in

hindsight bias. Psychol Rev 104:194–202. doi:10.1037/0033-

295X.104.1.194

Howie P, Roebers CM (2007) Developmental progression in the

confidence-accuracy relationship in event recall: insights pro-

vided by a calibration perspective. Appl Cogn Psychol

21:871–893. doi:10.1002/acp.1302

Jonsson A-C, Allwood CM (2003) Stability and variability in the

realism of confidence judgments over time, content domain, and

gender. Pers Individ Differ 34:559–574. doi:10.1016/S0191-

8869(02)00028-4

Juslin P, Olsson H, Winman A (1996) Calibration and diagnosticity of

confidence in eyewitness identification: comments on what

cannot be inferred from a low confidence-accuracy correlation.

J Exp Psychol Learn Mem Cogn 22:1304–1316. doi:

10.1037/0278-7393.22.5.1304

Kelemen WL, Frost PJ, Weaver CA III (2000) Individual differences

in metacognition: evidence against a general metacognitive

ability. Mem Cogn 28:92–107. doi:10.3758/BF03211579

Kelley CM, Lindsay DS (1993) Remembering mistaken for knowing:

ease of retrieval as a basis for confidence in answers to general

knowledge questions. J Mem Lang 32:1–24. doi:10.1006/

jmla.1993.1001

Kleitman S (2008) Metacognition in the rationality debate. Self-

confidence and its calibration. VDM Verlag Dr. Mueller,

Germany

Kleitman S, Stankov L (2007) Self-confidence and metacognitive

processes. Learn Individ Differ 17:161–173. doi:10.1016/j.lindif.

2007.03.004

Knutsson J, Allwood CM, Johansson M (2011) Child and adult

witnesses: the effect of repetition and invitation-probes on free

recall and metamemory realism. Metacogn Learn 6:213–228.

doi:10.1007/s11409-011-9071-y

Koriat A (1993) How do we know that we know? The accessibility

model of the feeling of knowing. Psychol Rev 100:609–639. doi:

10.1037/0033-295X.100.4.609

Koriat A (1994) Memory’s knowledge of its own knowledge: The

accessibility account of the feeling of knowing. In: Metcalfe J,

Shimamura AP (eds) Metacognition: knowing about knowing.

MIT Press, Cambridge MA, pp 115–135

Koriat A, Nussinson R, Bless H, Shaked N (2008) Information-based

and experience-based metacognitive judgments: Evidence from

subjective confidence. In: Dunlosky J, Bjork RA (eds) A

handbook of memory and metamemory. Lawrence Erlbaum

Associates, Mahwah, pp 117–136

Kuhn D, Dean D (2004) A bridge between cognitive psychology and

educational practice. Theory Pract 43:268–273

Kwok O, Underhill AT, Berry JW, Luo W, Elliott TR, Yoon M

(2008) Analyzing longitudinal data with multilevel models: an

example with individuals living with lower extremity intra-

articular fractures. Rehabil Psychol 53:370–386. doi:10.1037/

a0012765

Lane SM, Mather M, Villa D, Morita SK (2001) How events are

reviewed matters: effects of varied focus on eyewitness

suggestibility. Mem Cogn 29:940–947. doi:10.3758/BF03195

756

Leippe MR, Eisenstadt D (2007) Eyewitness confidence and the

confidence accuracy relationship in memory for people. In:

Lindsay RC, Ross DF, Read JD, Toglia MP (eds) Handbook of

eyewitness psychology, vol 2., Memory for peopleLawrence

Erlbaum Associates, Mahwah, pp 377–425

Leonesio RJ, Nelson TO (1990) Do different metamemory judgments

tap the same underlying aspects of memory? J Exp Psychol

16:464–470. doi:10.1037/0278-7393.16.3.464

Lichtenstein S, Fischhoff B (1977) Do those who know more also

know more about how much they know? Organ Behav Hum

Perform 20:159–183. doi:10.1016/0030-5073(77)90001-0

Lichtenstein S, Fischhoff B, Phillips LD (1982) Calibration of

probabilities: the state of the art of 1980. In: Kahneman D,

Slovic P, Tversky A (eds) Judgment under uncertainty: heuristics

and biases. Cambridge University Press, Cambridge, pp 306–334

Lindsay RCL, Wells GL, Rumpel CM (1981) Can people detect

eyewitness identification accuracy within and across situations?

J Appl Psychol 66:79–89. doi:10.1037/0021-9010.66.1.79

Marsh EJ, Tversky B, Hutson M (2005) How eyewitnesses talk about

events: implications for memory. Appl Cogn Psychol

19:531–544. doi:10.1002/acp.1095

Mengelkamp C, Bannert M (2010) Accuracy of confidence judg-

ments: stability and generality in the learning process and

predictive validity for learning outcome. Mem Cogn

38:441–451. doi:10.3758/MC.38.4.441

Nelson TO (1984) A comparison of current measures of the accuracy

of feeling-of-knowing predictions. Psychol Bull 95:109–133.

doi:10.1037/0033-2909.95.1.109

Nelson TO (1996) Gamma is a measure of the accuracy of predicting

performance on one item relative to another item, not of the

absolute performance on an individual item. Appl Cogn Psychol

10:257–260. doi:10.1002/(SICI)1099-0720(199606)10:3

Odinot G, Wolters G, Lavender T (2009) Repeated partial eyewitness

questioning causes confidence inflation but not retrieval-induced

forgetting. Appl Cogn Psychol 23:90–97. doi:10.1002/acp.1443

Paterson HM, Kemp RI (2006) Co-witness talk: a survey of

eyewitnesses’ discussion. Psychol Crime Law 12:181–192. doi:

10.1080/10683160512331316334

Perfect TJ (2002) When does eyewitness confidence predict perfor-

mance? In: Perfect TJ, Schwartz B (eds) Applied metacognition.

Cambridge University Press, Cambridge, pp 95–120

Quene H, van den Bergh H (2004) On multi-level modeling of data

from repeated measures designs: a tutorial. Speech Commun

43:103–121. doi:10.1016/j.specom.2004.02.004

Raudenbush SW (2001) Comparing personal trajectories and drawing

causal inferences from longitudinal data. Annu Rev Psychol

52:501–525. doi:0.1146/annurev.psych.52.1.501

Roediger HL III, Karpicke JD (2006a) Test-enhanced learning.

Taking memory tests improves long-term retention. Psychol Sci

17:249–256. doi:10.1111/j.1467-9280.2006.01693.x

Roediger HL III, Karpicke JD (2006b) The power of testing memory.

Basic research and implications for educational. Perspect

Psychol Sci 1:181–210. doi:10.1111/j.1745-6916.2006.00012.x

Schlottmann A, Anderson NH (1994) Children’s judgments of

expected value. Dev Psychol 30:56–66. doi:10.1037/0012-

1649.30.1.56

Schneider W (2008) The development of metacognitive knowledge in

children and adolescents: major trends and implications for

education. Mind Brain Educ 2:114–121

Cogn Process

123

http://dx.doi.org/10.1348/135532505X79672

http://dx.doi.org/10.1348/135532505X79672

http://dx.doi.org/10.1007/BF01064273

http://dx.doi.org/10.1002/acp.885

http://dx.doi.org/10.1037/0033-295X.104.1.194

http://dx.doi.org/10.1037/0033-295X.104.1.194


http://dx.doi.org/10.1016/S0191-8869(02)00028-4

http://dx.doi.org/10.1016/S0191-8869(02)00028-4

http://dx.doi.org/10.1037/0278-7393.22.5.1304

http://dx.doi.org/10.3758/BF03211579

http://dx.doi.org/10.1006/jmla.1993.1001

http://dx.doi.org/10.1006/jmla.1993.1001

http://dx.doi.org/10.1016/j.lindif.2007.03.004

http://dx.doi.org/10.1016/j.lindif.2007.03.004

http://dx.doi.org/10.1007/s11409-011-9071-y

http://dx.doi.org/10.1037/0033-295X.100.4.609

http://dx.doi.org/10.1037/a0012765

http://dx.doi.org/10.1037/a0012765

http://dx.doi.org/10.3758/BF03195756

http://dx.doi.org/10.3758/BF03195756

http://dx.doi.org/10.1037/0278-7393.16.3.464

http://dx.doi.org/10.1016/0030-5073(77)90001-0

http://dx.doi.org/10.1037/0021-9010.66.1.79


http://dx.doi.org/10.3758/MC.38.4.441

http://dx.doi.org/10.1037/0033-2909.95.1.109

http://dx.doi.org/10.1002/(SICI)1099-0720(199606)10:3


http://dx.doi.org/10.1080/10683160512331316334

http://dx.doi.org/10.1016/j.specom.2004.02.004

http://dx.doi.org/0.1146/annurev.psych.52.1.501

http://dx.doi.org/10.1111/j.1467-9280.2006.01693.x

http://dx.doi.org/10.1111/j.1745-6916.2006.00012.x

http://dx.doi.org/10.1037/0012-1649.30.1.56

http://dx.doi.org/10.1037/0012-1649.30.1.56

Schneider W, Lockl K (2002) The development of metacognitive

knowledge in children and adolescents. In: Perfect T, Schwartz

B (eds) Applied metacognition. Cambridge University Press,

Cambridge, pp 224–257

Schraw G (2009) Measuring metacognitive judgments. In: Hacker DJ,

Dunlosky J, Graesser AC (eds) Handbook of metacognition in

education. Routledge, New York, pp 415–429

Sijtsma K (2009) On the use, the misuse, and the very limited

usefulness of Cronbach’s alpha. Psychometrika 74:107–120. doi:

10.1007/s11336-008-9101-0

Sporer SL, Penrod S, Read D, Cutler B (1995) Choosing, confidence,

and accuracy: a metaanalysis of the confidence-accuracy relation

in eyewitness identification studies. Psychol Bull 118:315–327.

doi:10.1037/0033-2909.118.3.315

Stankov L, Crawford JD (1996) Confidence judgments in studies of

individual differences. Pers Individ Differen 21:971–986. doi:

10.1016/S0191-8869(96)00130-4

Thompson WB, Mason SE (1996) Instability of individual differences

in the association between confidence judgments and memory

performance. Mem Cogn 24:226–234. doi:10.3758/BF03200883

Tisak J, Meredith W (1990) Descriptive and associative development

models. In: Von Eye A (ed) Statistical methods in longitudinal

research, vol 2. Academic Press, Boston, pp 387–406

Wells GL, Bradfield AL (1999) Distortions in eyewitnesses’ recol-

lections: can the postidentification feedback effect be moder-

ated? Psychol Sci 10:138–144. doi:10.1111/1467-9280.00121

Wells GL, Lindsay RCL, Ferguson TJ (1979) Accuracy, confidence,

and juror perceptions in eyewitness identification. J Appl

Psychol 64:440–448. doi:10.1037/0021-9010.64.4.440

Wells GL, Olson EA, Charman SD (2002) The confidence of

eyewitnesses in their identifications from lineups. Curr Dir

Psychol Sci 11:151–154. doi:10.1111/1467-8721.00189

Wells GL, Memon A, Penrod SD (2006) Eyewitness evidence:

improving its probative value. Psychol Sci Publ Int 7:45–74

Yates JF (1990) Judgment and decision making. Prentice Hall,

Englewood Cliffs

Yates JF (1994) Subjective probability accuracy analysis. In: Wright

G, Ayton P (eds) Subjective probability. Wiley, New York,

pp 381–410

Cogn Process

123

http://dx.doi.org/10.1007/s11336-008-9101-0

http://dx.doi.org/10.1037/0033-2909.118.3.315

http://dx.doi.org/10.1016/S0191-8869(96)00130-4

http://dx.doi.org/10.3758/BF03200883

http://dx.doi.org/10.1111/1467-9280.00121

http://dx.doi.org/10.1037/0021-9010.64.4.440

http://dx.doi.org/10.1111/1467-8721.00189

stability in the metamemory realism of eyewitness confidence judgments

Documents