stability in the metamemory realism of eyewitness confidence judgments
TRANSCRIPT
RESEARCH REPORT
Stability in the metamemory realism of eyewitness confidencejudgments
Sandra Buratti • Carl Martin Allwood •
Marcus Johansson
Received: 22 October 2012 / Accepted: 18 June 2013
� Marta Olivetti Belardinelli and Springer-Verlag Berlin Heidelberg 2013
Abstract The stability of eyewitness confidence judg-
ments over time in regard to their reported memory and
accuracy of these judgments is of interest in forensic
contexts because witnesses are often interviewed many
times. The present study investigated the stability of the
confidence judgments of memory reports of a witnessed
event and of the accuracy of these judgments over three
occasions, each separated by 1 week. Three age groups
were studied: younger children (8–9 years), older children
(10–11 years), and adults (19–31 years). A total of 93
participants viewed a short film clip and were asked to
answer directed two-alternative forced-choice questions
about the film clip and to confidence judge each answer.
Different questions about details in the film clip were used
on each of the three test occasions. Confidence as such did
not exhibit stability over time on an individual basis.
However, the difference between confidence and propor-
tion correct did exhibit stability across time, in terms of
both over/underconfidence and calibration. With respect to
age, the adults and older children exhibited more stability
than the younger children for calibration. Furthermore,
some support for instability was found with respect to the
difference between the average confidence level for correct
and incorrect answers (slope). Unexpectedly, however, the
younger children’s slope was found to be more stable than
the adults. Compared to the previous research, the present
study’s use of more advanced statistical methods provides
a more nuanced understanding of the stability of confi-
dence judgments in the eyewitness reports of children and
adults.
Keywords Confidence � Confidence accuracy �Realism of confidence � Calibration � Stability �Eyewitness memory � Metamemory
Introduction
The level and accuracy of the confidence, expressed by
eyewitnesses with respect to their testimony, are important
because their reported confidence levels are often used by
jurors and other parties in the forensic context to assess the
reliability of their memory (Boyce et al. 2007; Cutler et al.
1988; Lindsay et al. 1981; Wells and Bradfield 1999; Wells
et al. 1979). However, research has shown that the accuracy
of confidence judgments is often lacking for both event
memory reports (Allwood 2010; Allwood et al. 2005, 2008;
Leippe and Eisenstadt 2007) and line-up identifications
(Brewer 2006; Sporer et al. 1995). Eyewitnesses to a crime
often talk about their memory of the event many times: with
the police, family, and friends, as well as in court (Gabbert
et al. 2003; Lane et al. 2001; Marsh et al. 2005; Paterson and
Kemp 2006), and this may influence their memories. In
general, memory changes over time, but less is known about
the stability of witnesses’ abilities to provide accurate
confidence judgments concerning the correctness of their
memory over time. The issue of the stability of eyewitness
confidence is relevant in forensic settings because it can
help the police and courts better judge the validity of eye-
witness reports. The accuracy of the eyewitnesses’ confi-
dence in their memories is also referred to as the realism of
eyewitness confidence judgments.
S. Buratti (&) � C. M. Allwood
Department of Psychology, University of Gothenburg,
Box 500, 40530 Goteborg, Sweden
e-mail: [email protected]
M. Johansson
Department of Psychology, Lund University, Lund, Sweden
123
Cogn Process
DOI 10.1007/s10339-013-0576-y
Only a few studies have investigated the issues of how
stable the participants’ confidence and realism of confi-
dence are over time (Jonsson and Allwood 2003; Meng-
elkamp and Bannert 2010; Stankov and Crawford 1996;
Thompson and Mason 1996). In general, although all four
studies looked at monotonic stability (explained below) by
means of using correlations, the results from these studies
should be compared and generalized with caution since
they differed in design and other aspects between them-
selves and from the present study. For example, the number
of test occasions varied between two (Stankov and Craw-
ford 1996; Thompson and Mason 1996, Experiments 1–2)
and three (Jonsson and Allwood 2003; Mengelkamp and
Bannert 2010, within the same session), and the time delay
between test occasions was one or 2 weeks, except in
Mengelkamp and Bannert (2010) where it was 10–20 min.
Moreover, all studies used alternative forms of the same
tests, except Mengelkamp and Bannert (2010) who used
partly the same questions and partly different questions
over the three test occasions. The majority of studies used
adult samples, usually undergraduates, with the exception
of Thompson and Mason, who in Experiment 2 used older
adults. Finally, Thompson and Mason (1996) used a
qualitative scale in unit steps from 0 (guessing) to 3
(Experiment 1) or 4 (Experiment 2) labelled certain, and
the other three studies used a numerical confidence scale
expressed in percent. The stability of confidence, as such,
over test occasions was not analyzed by Stankov and
Crawford (1996). However, the other three studies indicate
some stability of confidence over time. For example, the
Pearson correlations between the three test occasions in
Jonsson and Allwood (2003) varied between .71 and .87 for
the two types of material used. Kelemen et al. (2000)
studied the stability of other types of metacognitive
judgments.
Research on the confidence stability has generally not
considered that stability is a multifaceted concept. Many
kinds of stability exist, and how to define stability is
debated (Asendorpf 1989; Duncan et al. 2006). For some
researchers, the term ‘‘stability’’ may be understood as no
change at all. Tisak and Meredith (1990) called this type of
stability strict stability, which means no change between
time points for an individual. Another type of stability is
monotonic stability, which implies a linear change with
individuals maintaining rank order within the group. Strict
stability can be considered a special case of monotonic
stability. The present study considered these two types of
stability; to differentiate between them, we used multilevel
modelling (MLM) analysis.
None of the previous studies investigating the realism of
eyewitness confidence over time used MLM to assess the
intra-individual stability of the realism of eyewitness con-
fidence judgments. In the present study, we investigated the
stability of the realism of eyewitness’ confidence in their
episodic memory over time using MLM to assess intra-
individual stability. We also investigated the differences in
the degree of stability between three different age groups,
including children and adults.
Measures of realism in metamemory research
Many different measures exist for assessing the realism of
confidence. Different studies use different measures, which
focus on the different aspects of realism of confidence. The
measures are often divided into two categories: absolute
and relative measures (Bornstein and Zickafoose 1999;
Lichtenstein and Fischhoff 1977; Nelson 1996; Schraw
2009). Yates (1990, 1994) has provided a general overview
of measures of the accuracy of confidence judgments.
Absolute measures are measures that assess the rela-
tionship between confidence and the proportion of correct
answers. One such measure is bias (also known as over/
underconfidence), which provides information about pos-
sible overconfidence or underconfidence. Another absolute
measure is calibration, which is the mean squared devia-
tion of confidence from the percent of correct answers in
each confidence class (0, 10, 20 %, etc.), where each
confidence class is weighted by the number of items in that
class (Lichtenstein et al. 1982). Unlike bias, calibration
does not provide information about the deviation of eye-
witness confidence from perfect realism in terms of abso-
lute values. However, because calibration performs a
squaring operation for each confidence class, the deviation
from perfect realism is penalized for each class, and thus,
overconfidence at one end of the confidence scale cannot
be cancelled out by underconfidence at the other end of the
confidence scale, as is the case for bias.
Relative measures assess how correct and incorrect
answers are discriminated by the use of different confi-
dence levels. One such measure is resolution, which does
not provide any information about the direction of the
discrimination (Lichtenstein et al. 1982). Not knowing the
direction means that the same degree of resolution may
mean that high-confidence judgments are assigned to cor-
rect answers and low-confidence judgments to incorrect
answers, or that low-confidence judgments are assigned to
correct answers and high-confidence judgments to incorrect
answers. Furthermore, the difference between the average
confidence for correct answers and the average confidence
for incorrect answers is not provided by the resolution
measure; it only states the extent to which the person can
sort correct and incorrect answers into two separate groups
using their confidence judgments. Another relative measure
originating from signal detection theory is the distance-
based metric da (see Benjamin and Diaz 2008). Yet another
relative measure is the Goodman–Kruskal gamma
Cogn Process
123
correlation, which is the proportion of concordance
between X and Y item pairs (see Nelson 1984). A final
relative measure is the Pearson correlation between confi-
dence and the correctness of reported memories (also
called the confidence/accuracy (C/A) correlation). Like the
gamma correlation, this measure shows the direction of the
discrimination but not the absolute size of the separation
between the mean confidence for correct and incorrect
answers.
Finally, slope provides a measure of the difference
between the average confidence for correct and incorrect
answers. This measure includes both a relative component
and an absolute component (i.e., information about the
eyewitness’ ability to live up to the normative goal of
maximal separation of the level of the confidence judg-
ments for correct and incorrect answers) (Yates 1990).
Thus, slope may be argued to belong to a third category of
realism measures: mixed measures. However, slope builds
on the eyewitness’ ability to discriminate between correct
and incorrect answers by using their confidence judgments,
and also gives information about the extent and direction of
this separation. For this reason, we will consider slope as a
relative measure. For example, a high positive value for
slope indicates that the level of the eyewitnesses’ confi-
dence judgment is informative about the correctness of
their memory. In contrast, if all confidence judgments are,
for example, at the 75 % confidence level and the witness
has 75 % of the answers correct, the witness shows no bias,
but also no slope. Here, the confidence judgments would
not help personnel in the justice system to discriminate
between correct and incorrect answers. These examples
also show that different measures of realism are useful
because they provide information about different aspects.
The value of various measures of confidence realism has
been debated in eyewitness research. Traditionally, the
Pearson correlation has been used in this research, espe-
cially in the research on lineups, but for nearly 20 years,
many researchers have argued that absolute measures, such
as bias, are more appropriate for eyewitness research
(Brewer 2006; Juslin et al. 1996; Wells et al. 2002, 2006;
see also Allwood 2010). Juslin et al. (1996) and Brewer
(2006) noted that perfect discrimination may demand that
the eyewitness has control of the conditions, such as the
degree of visibility, that the witness is commonly not able
to control. Thus, to ask for perfect discrimination is to ask
for the impossible. In contrast, calibration gives the witness
a chance to weigh in, for example, that the visibility,
duration, or angle of observation were poor when the ori-
ginal event was observed. Although these authors primarily
considered line-up situations, the same argument applies to
event memory in general.
Juslin et al. (1996) noted that, even in cases where
perfect calibration is present, the more the confidence
judgments are located at the ends of the scale, the better is
the correlation. Wells et al. (2006) noted that a conse-
quence is that higher C/A correlations are achieved when
different viewing conditions or different types of witnesses
are investigated compared to when more similar conditions
or types of witnesses are used. Finally, measures, such as
the C/A correlation, that only provide information about
the amount of variance explained, in addition to the
direction of the association between confidence and cor-
rectness, may be less helpful to personnel in the justice
system than measures that give information about absolute
levels of realism, such as bias, because the latter infor-
mation may be easier to understand.
In brief, different measures provide information on
different aspects of confidence realism. Because we wanted
to analyze stability for both absolute and relative accuracy,
we included bias, calibration, and slope as indicators of
confidence realism. Bias provides an absolute measure that
indicates the direction of realism (over/underconfidence).
The reason for using the calibration measure is that, in
contrast to bias, calibration calculates the squared differ-
ence within each confidence class and thus, for example,
underconfidence at one end of the confidence scale cannot
cancel out overconfidence at the other end of the scale. We
used slope because of its mentioned advantages and
because it is intuitively easy to understand.
Research on the stability and realism of confidence
judgments
To investigate confidence and confidence realism stability,
the processes on which metacognitive judgments are based
need to be considered. Metacognitive judgments are based
on different cues that stem from all information that is
activated when the judgment is made (Koriat 1993, 1994).
Such cues can be purely information-based, in that the cues
concern a judgment of one’s own knowledge or compe-
tence within a certain area (i.e., the theory we have
regarding how competent we are at, for example, answer-
ing general knowledge questions, Koriat et al. 2008). The
cues can also be experience-based and concern structural
information, such as processing fluency. For example,
answers that come easily to mind are experienced as having
high processing fluency and may be given higher confi-
dence judgments (Kelley and Lindsay 1993; Koriat 1993).
Research has shown that the levels of confidence for
different types of general knowledge tasks are highly cor-
related; thus, a confidence trait appears to exist (Kleitman
2008). Bornstein and Zickafoose (1999) found that peo-
ple’s confidence levels for a general knowledge task
moderately correlated with their confidence levels for an
eyewitness task. These results are in line with Koriat
et al.’s (2008) idea of information-based cues, and
Cogn Process
123
similarly, a general preference for high confidence might
also give rise to an information-based cue. People may
have a feeling (or notion) of competence when it comes to
recalling information and events (information-based cues),
and this feeling may influence their confidence judgments.
Consequently, basing confidence on information-based
cues may, in turn, lead to some stability in confidence
judgments and in the absolute measures of confidence
realism. Finally, the stability of bias and calibration may be
due, to some extent, to the stability of the participants’
confidence judgments as these measures include confi-
dence. As previous research has indicated stability in
confidence, this statistical effect is likely to contribute to
the stability of absolute measures of realism.
Several studies have reported the stability of absolute
measures. Stankov and Crawford (1996) found evidence for
bias stability and somewhat weaker evidence for stability in
calibration, when participants performed Raven’s progres-
sive matrices, a perceptual test, two vocabulary tests, and
two-digit span tests. Similar results were found for cali-
bration and bias for word knowledge and logical spatial
ability (Jonsson and Allwood 2003), and for learning about
operant conditioning (Mengelkamp and Bannert 2010).
Previous research has mostly covered semantic memory
retrieval, but similar processes may be valid for episodic
memory retrieval. For example, a person may believe that
he or she is very good at recalling events and base judg-
ments on this assumption, resulting in high-confidence
judgments in eyewitness tasks. Conversely, a person may
highly doubt his or her memory for events and consistently
express low confidence in eyewitness tasks. Thus, we
expect some stability in absolute measurements of realism
also for the event memory task used in this research.
However, there are also some other differences between
the situations investigated in the present study and in the
previous research on semantic memory. For example, as
described above, the previous studies mostly asked about
similar but new material on each occasion, whereas our
participants were asked different questions on each occasion
but these questions all concerned the same previous short
event. In spite of these differences, we speculated that use of
the type of information-based cues described above would be
present and used also in the context we investigated.
In addition to information-based cues, as noted above,
confidence judgments can also be influenced by structural
information or so-called experience-based cues (Koriat
et al. 2008). Such experience-based cues may differ for
judgments made at different time points after the original
event. Leonesio and Nelson (1990) argued that structural
information could differ at different points in time when
used in metacognitive judgments. Since relative measures
attend to how well confidence discriminates between cor-
rect and incorrect answers, they are likely to be affected by
experience-based cues. The changes in structural infor-
mation over time are thus likely to contribute to instability
in slope and relative measures in general. For example, a
good slope value is dependent on the confidence level for
correct answers and for incorrect answers, and these levels
may, to a large extent, be derived by using fluency and
other experience-based cues that may be influenced by
when in time the confidence judgment is made. Thus, even
though relative measures to some extent are influenced by
information-based cues, the influence of experience-based
cues on relative measures may well be larger since these
measures to a large extent can be expected to be influenced
by cues relating the correctness or incorrectness of the
specific answers. Thus, the heavy influence by experience-
based cues can be expected to lead to instability in these
measures.
In accordance with this suggestion, several studies have
shown that relative measures evidence no or little stability
and do not exhibit the same degree of stability as absolute
measures. Stankov and Crawford (1996) reported low sta-
bility for the relative measures of resolution and slope,
slope showing somewhat stronger stability than resolution.
However, Jonsson and Allwood (2003) did report some
stability for resolution in their study (they did not investi-
gate slope). Thompson and Mason (1996) reported a lack
of stability for Goodman–Kruskal gamma correlation (used
as a measure of confidence realism) in a general knowledge
task, a word recognition test, and a face-recognition task.
The same lack of stability for Goodman–Kruskal gamma
correlation and Pearson’s r and da was shown by Meng-
elkamp and Bannert (2010). In brief, various measures of
relative accuracy have been used in previous studies and
the results provide evidence of fairly low stability, or
instability, over time for these measures.
In brief, although the previous research used somewhat
different designs and differed in other aspects compared
with the present study, we expected stability for absolute
measures and a lack of stability for slope. One reason is
that slope is influenced, to a greater extent than the absolute
measures, by the level of confidence for correct and
incorrect answers. Other than the face-recognition task
used by Thompson and Mason (1996), we know of no
earlier study investigating the intra-individual stability of
realism of confidence for an eyewitness situation.
A better understanding of the degree of stability and
type of stability (strict or monotonic) of absolute and rel-
ative measures of realism in eyewitness confidence judg-
ments can help staff in the justice system better understand
the conditions for diagnosing the degree of realism asso-
ciated with the confidence judgments of an eyewitness. For
example, it is of forensic relevance whether confidence
judgments given at various points in time after the crime
event can be expected to be strictly stable, monotonically
Cogn Process
123
stable, or not stable at all. Our results pertain especially to
situations where the interviewee is asked questions that
differ between interview occasions, but where the ques-
tions all relate to the same short-duration event. Further-
more, given that the research reviewed above has found
some measures of confidence realism to be more stable
than other measures, the degree of realism of a specific
witness might be diagnosed by giving the witness a small
test using these measures of realism. Although not the
primary aim of the present study, our results also have
implications for the issue of the existence of a confidence
trait suggested by Kleitman and Stankov (2007). For
example, a finding of at least monotonic stability for con-
fidence might, depending on the other results, be inter-
preted as supportive of such a trait.
Stability in different age groups
Three age groups (8 to 9 year olds, 10 to 11 year olds, and
adults) were chosen in order to compare stability in relation
to age. Research shows that metacognitive abilities
improve with age (Kuhn and Dean 2004; Schneider 2008;
Schneider and Lockl 2002). Children younger than 8 years
old were not included in the study because the extent to
which they can understand the confidence judgment task as
presented in this research is not completely agreed upon by
the researchers (Allwood et al. 2006, 2008; Howie and
Roebers 2007; Schlottmann and Anderson 1994). The 10-
to 11-year-old age group was included in order to study the
rate at which metacognitive stability changes for children.
Comparing different age groups is of interest because
different types of cues likely form the basis of confidence
judgments. Younger children may use more experience-
based cues to make confidence judgments, as the younger a
person is, the less likely they have formed a theory
regarding their performance in similar tasks (information-
based cues). On the other hand, adults have had plenty of
time to form a theory regarding their performance in a
number of different tasks, which in turn would lead them to
make more information-based metacognitive judgments.
As children’s confidence judgments are more likely to be
based on the experience-based cues, younger children may
be expected to show more instability in both confidence
and absolute accuracy measures, as well as relative accu-
racy measures, such as slope. We know of no previous
study that has investigated differences in the degree of
stability in confidence realism between age groups.
Multilevel modelling
Recently, a number of new statistical techniques have
emerged that can be used to investigate stability and
change over time. MLM is one such technique, allowing
researchers to investigate whether the time variable is
associated with the level of confidence realism (Rauden-
bush 2001). MLM also allows the researchers to make
inferences about the within-person variance (i.e., the intra-
individual variation for a person over time) as well as the
between-person variance (i.e., differences between people).
For example, age group can be used as a predictor for
investigating differences in change patterns in the mea-
sured variable across time.
Kwok et al. (2008) reviewed several reasons why MLM
techniques are preferred over repeated measures ANOVA
when using a repeated measures design. For example,
MLM allows the estimation of individual-level trends over
time. In contrast, repeated measures ANOVA estimates an
average trend for all participants and treats the individual
variance as unexplained error. Thus, individual variance is
not a problem for MLM, but it is considered noise in a
repeated measures ANOVA. When applying a repeated
measures design with data collected at three time points,
MLM provides a possibility of estimating the individual
differences in rates of change.
Aims
The purpose of the present study was to investigate the
level of stability across time in the proportion of correct
memory reports, confidence, and realism as measured by
bias, calibration, and slope for memory reports of a wit-
nessed event. We looked for evidence of intra-individual
stability (strict or monotonic) for any of the dependent
measures across time. If a measure shows strict stability,
a time-change model cannot be fitted, that is, there should
be no systematic change predicted by time. Also, for a
measure that shows strict stability, there should not exist
any non-systematic change due to other known and
unknown variables. If there exists monotonic stability,
then there could be a linear fixed effect of time, that is,
either a linear increase or a decrease over time for all
individuals. However, there should not be a random effect
of time indicating that individuals’ rank order is being
violated. In addition, we investigated whether a difference
exists in the degree of stability between different age
groups.
Based on the theory regarding information-based and
experience-based cues by Koriat et al. (2008) and the
results from previous studies, we hypothesized that
1. The confidence level will show monotonic stability
over time. We did not expect strict stability, since
confidence level is dependent on the correctness of the
answers, which might change over time.
2. The absolute measures of realism (i.e., bias and
calibration) will show strict stability over time. Since
Cogn Process
123
we did not find any convincing reason why the
absolute confidence realism measures would differ
between the occasions, we did not just expect mono-
tonic stability for these measures.
3. Slope, a relative measure of realism, will not be stable
over time.
4. The adults will show a higher degree of stability in
confidence and the two absolute metamemory mea-
sures than the two child groups. Two general reasons
for expecting higher stability for the adults are that
children may be more driven by experience-based cues
than adults and that children may have less stable
cognitive systems (see, e.g., Knutsson et al. 2011).
Method
Participants
A total of 32 children aged 8–9 years (younger children,
Mdage = 9), 31 children aged 10–11 years (older children,
Mdage = 11), and 30 adults (Mdage = 23, ranging from 19
to 31 years of age) participated in the study. Four partici-
pants (one older child and three adults) did not complete all
required parts of the study. Three of the participants
completed only two of the three sets of questions, and one
participant completed only one of the three sets of ques-
tions. The incomplete data were analyzed using the Little’s
MCAR test and found to be missing completely at random.
Therefore, the incomplete data from these four participants
were imputed with an EM logarithm.
Design
The three groups each participated on four occasions,
8 days apart, three of which were retrieval occasions.
Three sets of 20 questions were counterbalanced across
participants and retrieval occasions. No ordering effect was
found on any of the measures. A pilot study found that the
degree of difficulty was equal across the three sets. Also, a
repeated measures ANOVA of the data in this study found
that there was no significant difference in difficulty
between the three sets of questions (F(2, 184) = .30,
p = .738).
Materials
Film clip
The participants watched a film clip (3 min and 4 s long, in
color) depicting a street crossing at a bus central in the
center of Lund, Sweden. The film clip showed a street
scene, including shops and other buildings, with people and
vehicles passing. Given that witnesses sometimes give
testimony simply about the presence of, for example,
people or cars in specific settings (locations at specific
times), this film was judged to be relevant to our research.
Questionnaire
Three sets of questions, each with 20 different two-alter-
native forced-choice questions about the content of the
film, were used. The questions about the film’s content
pertained to aspects such as actions, persons, and physical
objects. Different questions were used at each measure-
ment point in order to, as far as possible given the design of
the present study, avoid confounding the effect of stability
with repeated questions. For each question, one of the two
answer alternatives was always correct. An example of a
question is: ‘‘What did the boy with the skateboard have in
his hand: (a) a bottle or (b) a plastic bag?’’ For each
question, the participant rated his or her level of confidence
on a scale ranging from 50 % (‘‘guessing’’) to 100 % (‘‘I’m
absolutely sure that my answer is correct’’), with 10 %
increments rendering a total of six confidence classes.
Procedure
The children participating in the study were recruited at a
school located in the far south of Sweden. The children’s
parents received information about the study via a letter
and returned a response form stating whether or not they
gave their consent for their child to take part in the study.
The adult participants were students at Lund University,
Sweden. When recruited, the participants were told they
would watch a short film clip and then answer questions
about the content of the film on three occasions.
First, the participants watched the film clip. Eight days
after watching the film clip, the participants took part in the
first of the three paper-and-pencil tests. On each test
occasion, the participants completed one of the three sets of
20 questions and made confidence judgments. The partic-
ipants were tested in groups of 2–3 individuals and could
not see each other’s responses as they were seated back to
back.
On the first occasion, the experimenter used one very
easy and one very difficult practice question to inform the
participants about the confidence scale. As predicted, all
children made the appropriate confidence responses to both
the easy and the difficult question. In this context, we
informed the children that the 50 % confidence level cor-
responded to guessing. The children were also told that if
they were neither guessing nor absolutely certain, they
were free to choose an appropriate confidence among the
remaining four confidence levels (60, 70, 80, 90 %).
Cogn Process
123
In previous research, this type of scale has been success-
fully used in with children as young as 8 years (Allwood
et al. 2008).
The adult participants were informed that the instruc-
tions and questions had been designed to suit young chil-
dren who also participated in the study. When informing
the adult participants about how to answer the questions
and to make confidence judgments, we used practice
questions more suited to adults. Each participant was asked
not to discuss the film or the questions about its content
with other persons, regardless of whether those persons
participated in the study.
The participating children’s school was rewarded
approximately 460 USD, and each of the adult participants
was awarded approximately 15 USD for partaking in the
study.
Results
Preliminary analyses found no significant difference
between the sets of questions for any of the dependent
measures (proportion correct, confidence, bias, calibration,
and slope). Reliability analyses were carried out for the raw
scores: proportion correct and confidence. Due to negative
inter-item covariance for proportion correct, the assump-
tions for reliability testing was not met and no alpha values
can be reported for the three sets of questions. This may be
due to several different aspects, for example, the binary
nature of the data may have caused a restricted spread. For
confidence, Cronbach alpha was .93, .96, and .96 for the
three sets of questions.
Independent variables and scoring procedure
for dependent measures
The independent time variable, week, was centered at the
first measurement point (i.e., the first retrieval occasion),
which was used as the reference point in the analysis (week
1). Because of the short delay between the measurement
occasions, the ages of the participants did not change
during the study and therefore age group was treated as a
constant between-person (Level 2) variable. The between-
person (Level 2) variable, sex, was also evaluated.
In addition to the proportion of correct answers and
confidence, we used several metacognitive outcome mea-
sures following the recommendations of Schraw (2009).
More specifically, three such measures were assessed: bias,
calibration, and slope. Bias is calculated by subtracting a
person’s proportion of correct answers from their average
confidence judgment. A value near zero indicates that the
person is neither over- nor underconfident. Calibration is
calculated using the following formula:
Calibration ¼ 1=nXT
t¼1
nt rtm � ctð Þ2
Here, n is the total number of questions judged, T is the
number of confidence classes used, nt is the number of
confidence judgments within confidence class rt, rtm is the
mean confidence level in confidence class rt, and ct is the
percent of correct answers within confidence class rt. For
each confidence class, the percent of correct answers within
that class is subtracted from the mean level of confidence
within that class. This difference between the mean
confidence and percent of correct answers is squared and
multiplied by the number of times the confidence class was
used by the participant. The resulting products for each
confidence class are summed, and the sum is divided by the
total number of questions (e.g., Lichtenstein et al. 1982).
The third measure was slope, which indicates how well
the participants differentiated between correct and incor-
rect answers. Slope is calculated by subtracting the average
confidence for incorrect items from the average confidence
for correct items. If a person has low confidence for
incorrect answers and high confidence for correct answers,
it would indicate good separation and, thus, a good slope.
In the present study, the confidence scale ranged from 50 to
100 %. Thus, a slope value of .5 indicates perfect separa-
tion, 0 indicates no separation at all, and -.5 indicates
faulty separation (high-confidence judgments for incorrect
items and low-confidence judgments for correct items).
Analyses
To test the first three hypotheses, data were modelled using
the mixed procedure in the SPSS software. An uncondi-
tional means model, in which time is not added as an
independent variable, was estimated and intraclass corre-
lation (ICC) calculated. The ICC is the proportion of
between-person variance of the total variance (both the
between- and the within-person variance) and is calculated
by taking the between-person variance and dividing it by
the sum of the between- and within-person variance. The
ICC is thus also interpreted as the within-person correlation
of the dependent variable across measurement points since
the higher the proportion of the between-person variance,
the lower the within-person variance between measurement
points (Quene and van den Bergh 2004). Thus, a high ICC
indicates high intra-individual stability. These model esti-
mations and ICC calculations were done for each of the
five dependent measures: proportion of correct memory
reports, confidence, bias, calibration, and slope. After the
unconditional model and the ICC calculations were made,
the time models were estimated using p-values to evaluate
the significance of fixed effects. The fit of the random
Cogn Process
123
effects model was evaluated using the v2 test to determine
the significance of the different models’ -2 restricted log-
likelihood (REML) values. To further investigate the sta-
bility of the dependent measures, the individual trajectories
were assessed graphically, as a failure to fit a model
showing that time predicts change does not necessarily
mean that a dependent measure is stable. There may still
exist non-systematic changes that are not predicted by the
variable of time. Thus, a dependent measure can be
unstable due to two types of changes, namely systematic
(time predicts change with fixed or random effects) or non-
systematic (other known or unknown variables). To
investigate whether age group can explain differences in
patterns of change in the measured variable across time,
this predictor was added to all models as well. The pre-
dictor sex was added to all models but had no significant
effect and was therefore excluded from the final models.
To test the fourth hypothesis regarding differences in the
degree of stability between the different age groups, a
range value was calculated for each individual by taking
his or her maximum score of the three measurement points
for a measure and subtracting the individual’s minimum
score for the same measure. One-way ANOVAs were
conducted for each dependent measure using the between-
groups variable of age group. ANOVAs were used since
differences in degree of stability between age groups can-
not be investigated using MLM.
Descriptives of the dependent measures are given in
Table 1. The adults were, on average, slightly undercon-
fident across all three measurement points. In contrast, the
older and younger children exhibited overconfidence across
the three measurement points. All three age groups showed
low separation in their confidence judgments as indicated
by the slope measure.
Assessing intra-individual stability
As the Goodman–Kruskal gamma correlation has been
used as a measure of stability across time in earlier studies
(e.g., Mengelkamp and Bannert 2010), it was calculated
between the different measurement points. The overall
Goodman–Kruskal gamma correlations between weeks
1–2, 2–3, and 1–3 for each dependent measure are provided
in Table 2. Corresponding Spearman correlations showed
similar patterns of results and are therefore not reported
here. Estimates and standard deviations for the fitted
models for the proportion of correct answers, confidence,
bias, calibration, and slope are provided in Table 3.
Table 1 Mean and standard deviation (SD) values for adults, older
children (10–11 years), and younger children (8–9 years) on pro-
portion correct, confidence, bias, calibration, and slope across three
time points
Week 1 Week 2 Week 3
M SD M SD M SD
Proportion correct .589 .107 .589 .107 .578 .116
Adults .640 .089 .632 .088 .646 .094
Older children .595 .107 .567 .094 .562 .092
Younger children .537 .100 .569 .125 .531 .129
Confidence .672 .111 .667 .128 .663 .125
Adults .620 .065 .598 .075 .598 .079
Older children .684 .100 .673 .104 .677 .113
Younger children .589 .107 .589 .107 .578 .116
Bias .080 .155 .076 .176 .082 .180
Adults -.022 .093 -.036 .106 -.050 .114
Older children .088 .141 .105 .141 .113 .159
Younger children .168 .161 .154 .205 .177 .179
Calibration .078 .065 .075 .073 .077 .065
Adults .049 .043 .041 .024 .039 .028
Older children .079 .054 .079 .059 .081 .057
Younger children .104 .080 .104 .100 .110 .078
Slope .027 .064 .021 .061 .018 .052
Adults .046 .080 .049 .070 .045 .053
Older children .021 .054 .007 .045 .014 .044
Younger children .016 .055 .009 .058 -.004 .047
Table 2 Goodman–Kruskal gamma correlations between week 1–2,
week 2–3, and week 1–3 for adults, older children (10–11 years), and
younger children (8–9 years) for the different dependent variables
Week 1–2 Week 2–3 Week 1–3
Proportion correct .123 -.027 .026
Adults -.283 .000 -.231
Older children .044 -.071 .048
Younger children .318* -.198 -.186
Confidence .620*** .668*** .540***
Adults .220 .460*** .191
Older children .670*** .681*** .563***
Younger children .753*** .688*** .611***
Bias .382*** .378*** .352***
Adults -.010 .256* .138
Older children .218 .300** .303*
Younger children .549*** .279* .172
Calibration .387*** .296*** .244***
Adults .219 -.242* -.149
Older children .253* .365* .303*
Younger children .565*** .290* .159
Slope .003 .108 .023
Adults -.151 -.071 -.127
Older children -.090 .096 -.068
Younger children .177 .079 .074
p-values indicate the significance for correlations between weeks
* p \ .05, ** p \ .01, *** p \ .001
Cogn Process
123
Proportion of correct answers
The ICC for the proportion of correct answers for all
memory reports indicated that the proportions correct
correlated poorly across the three different time points.
Accordingly, only 5 % of the variance was the result of
between-person differences in the scores for the proportion
of correct answers. Consequently, 95 % of the variance in
the scores for the proportion of correct memory reports was
due to within-person differences across the three mea-
surement points. No significant fixed linear or random
linear effects of time on the scores for the proportion of
correct memory reports were found, that is, we found no
systematic change in the proportion of correct memory
reports that could be accounted for by the time variable.
When age group was added to the unconditional model as a
predictor, SPSS failed to estimate the group values, prob-
ably because of the small ICC.
Although the failure to fit a model of linear change
should indicate stability, the low ICC indicates that the
intra-individual score for the proportion of correct memory
reports was not stable for the majority of participants. This
indication of poor stability was also supported by graphical
examinations of the trajectories which showed that a
majority of the participants differed considerably between
the three measurement points. In total, the mean variation
in proportion correct answers was 17.5 %.
Confidence
The between-person variance of confidence accounted for
78 % of the total variance, indicating that the participants’
level of confidence was highly correlated between the three
different time points. When estimating the model, no sig-
nificant fixed linear effect of time on confidence scores was
found; the time variable did not account for any systematic
change in confidence on the group level. However, a sig-
nificant random linear effect of time was found (REML v2
difference (2) = 11.515, p \ .01). This indicates that there
were significant differences between individuals in the
patterns of linear change observed. To further investigate
this finding, confidence intervals (CI) were calculated for
the random effects. For 95 % of the individuals, the tra-
jectory was predicted to be between -6.72 % units and
5.92 % units, indicating that the level of confidence
declined over time for some individuals and increased for
others. Thus, participants need to be described by different
trajectories. When graphically examining the trajectories, it
was also evident that there were participants that did not
follow these patterns of increase and decrease but rather
fluctuated. The mean variation in confidence between the
measurement points was 8.6 %. When age group was
added as a predictor, a significant main effect of age group
(F(2, 90.01) = 8.48, p \ .001) indicated a difference in
confidence scores between all three age groups at the first
Table 3 Longitudinal models for the outcome measures of proportion correct, confidence, bias, calibration and slope for adults, older children
(10–11 years), and younger children (8–9 years)
Proportion correct Confidence Bias Calibration Slope
Est. SE Est. SE Est. SE Est. SE Est. SE
Fixed effects
Intercept (c00) .639*** .011 .611*** .019 -.036 .021 .043*** .009 .046 .006
Week (c10) -.004 .005
Age group (c01)
Adultsa b b 0 0 0 0 0 0 b b
Older children b b .071** .026 .138*** .030 .037*** .012 b b
Young children b b .106*** .026 .200*** .030 .063*** .012 b b
Random effects
Residual var. eti b .002*** .000 .013*** .002*** b b
Intercept var. U0i b .011*** .002 .009*** .001*** b b
Week variance U1i .001** .000
Int.-week covar. (U0i,U1i) -.001 .001
REML deviance - 458.232 -583.040 -295.962 -767.005
Week is centered at week 1. Est. = estimate, SE = standard error, Residual var. = the within-person variance, Intercept var. = the intercept
variance, Week var. = the week variance, and Int.-week covar. = the covariance between intercept and week. REML deviance = -2 restricted
log likelihooda Adults is the reference point of age groups, b SPSS was not able to calculate the estimate
* p \ .05, ** p \ .01, *** p \ .001
Cogn Process
123
measurement point. The younger children scored 10.6 %
units (SE = 2.6 %) higher on the confidence measure than
the adults, and the older children scored 7.0 % units
(SE = 2.6 %) higher than the adults on the same measure.
This result indicates that the different age groups need
different intercepts. The final model was:
Level 1 : Confidenceti ¼ b0i þ b1t Weektið Þ þ eti
Level 2 : b0i ¼ c00þc01 Age� groupð Þ þ U0i
b1i ¼ c10 þ U1i
In this model, Level 1 indicates the within-person
effects and Level 2 renders the between-person effects. As
seen in the model, confidence is a function of the expected
value of an individual i on the b0i intercept, the b1t
trajectory, which refers to an estimated value of linear
change for an individual as a function of Weekti, and eti,
which is the within-person variance. Furthermore, the
expected value for an individual i(b 0i) is a function of the
fixed intercept (c00), which is the mean confidence level at
the first measurement point, the age group (c01), and the
between-person variance in the intercept (U0i). In addition,
b1t is a function of the mean (nonsignificant) linear change
in time, that is, the fixed effect of Week (c10) and the
variation in change between individuals across weeks, the
Week variance (U1i). In brief, the results indicate
instability in the individual participants’ confidence over
time.
Bias
The between-person variance of bias accounted for 54 %
of the total variance. Consequently, the bias measure
moderately correlated across the three measurement points.
While estimating the model, no significant fixed linear or
random linear effect of time on bias measure was found.
When the predictor age was added to the unconditional
model, a significant effect was found (F(2, 90) = 25.00,
p \ .001). Younger children (.20, SE = .03) and older
children (.14, SE = .3) were significantly more biased at
the beginning of the study (at the first measurement point)
than adults.
The failure to estimate a time-change model indicates
that the intra-individual stability of the bias measure is
high. This indication was supported when each individual’s
trajectory was examined graphically, which showed that
the participants did not differ much between the three
measurement points. The mean variation of the bias score
was .19 (it should be noted that bias has a larger scale span
with possible values from -1 to 1, unlike the calibration
measure). Therefore, we found evidence that the level of
bias is strictly stable (see also Tisak and Meredith 1990),
that is, it does not differ significantly for individuals over
time.
Calibration
Intraclass correlation was calculated and the between-
person variance of calibration accounted for 45 % of the
total variance. Consequently, calibration moderately cor-
related between the three different time points. No sig-
nificant fixed linear effect or random linear effect of time
on calibration was found. When the predictor age group
was added to the unconditional model, a significant effect
was found (F(2, 90) = 13.48, p \ .001). Younger children
(.06, SE = .01) and older children (.04, SE = .01) were
significantly less well calibrated than adults at the begin-
ning of the study (at the first measurement point). The
failure of fitting a time-change model indicates intra-
individual stability in the calibration measure over time.
This finding was also supported by a graphical examina-
tion of each individual’s trajectory, which indicated that a
majority of the participants did not differ much between
the three measurement points. The mean difference in
calibration was .07. Therefore, we found evidence that the
level of calibration is strictly stable, that is, it does not
differ significantly for individuals over time.
Slope
The ICC was calculated and the between-person variance
of the slope measure accounted for 6 % of the total vari-
ance. Consequently, the within-person correlation of slope
for the different measurement points was low. When esti-
mating the model, no fixed linear or random linear effect of
time on the slope measure was found. When age group was
added to the unconditional model as a separate predictor,
SPSS failed to estimate the group values, probably because
of the small ICC. Although fitting a time-change model
failed, the low ICC indicated that the slope measure was
not very stable over time. This finding was supported when
the trajectories were examined graphically, since a major-
ity of participants differed between the three measurement
points. However, the mean variation was only .09, which
can be the result of different trajectories cancelling out
each other. Thus, the findings regarding the slope measures
stability are mixed. Although the failure to fit a time model
can be an indication of strict stability, the low ICC indi-
cates that there exists unsystematic change, which also
would lead to a failure to fit a time-change model. This
unsystematic change in the slope measure as indicated by
the low ICC shows that the slope measure shows tenden-
cies for being less stable than the two absolute measures.
Differences in degree of stability between age groups
In order to test hypothesis 4, the average range value for
each age group was calculated as described above. These
Cogn Process
123
values are provided in Table 4. A lower value indicates
lesser range and higher stability across time points. One-
way ANOVAs showed no differences in the range values
between the age groups for the proportion of correct
answers, confidence, and bias. A significant difference in
stability between the age groups was found for calibration
(F(2, 57.802) = 6.164, p = .004, g2 = .146). Because the
assumption of homogeneity for calibration was not met, the
Welch formula was used in the ANOVA above and a
Games–Howell post hoc analysis was conducted. The
younger children were significantly less stable than the
adults (p \ .01) and the older children (p \ .05). A sig-
nificant difference in stability was found between the age
groups for slope (F(2, 90) = 3.89, p \ .05, g2 = .086).
Bonferroni post hoc analysis revealed that the adults
exhibited a significantly less stable slope value than the
younger children (p \ .05).
Discussion
In the present study, we investigated the intra-individual
stability of confidence judgments and the realism of those
judgments over time. As a background to this issue, we first
discuss whether the age groups differed in confidence and
the confidence realism measures. However, since adults
were the reference group, the differences found between
the child groups for the confidence, bias, and calibration
measure are only descriptive in nature. Overall, the results
indicated a difference in the age groups in regard to the
level of confidence. The adults were significantly less
confident than the children. The youngest children
appeared to be the most confident age group. We found
slight underconfidence in the adult group and fairly marked
overconfidence in the two groups of children. The adults
were also better calibrated than the older children, and they
in turn were better calibrated than the younger children.
These results are in line with previous research (Allwood
et al. 2008; Knutsson et al. 2011). The results for slope, due
predominantly to small correct–incorrect differences,
indicated that all groups differentiated poorly between
correct and incorrect answers at all time points.
Our first hypothesis that the participants’ confidence
level would be stable across time was not confirmed
because the confidence level significantly increased for
some individuals but decreased for other individuals,
indicating that confidence may not be as stable as earlier
studies have reported. However, it is important to note that
the context investigated in the present study differs from
the contexts that have been investigated in previous
research. The present study may be the first to investigate
the stability of confidence and measures of confidence
realism when individuals answer questions on different
occasions about different details of the same event. The
fact that the participants answered different questions at the
three occasions may have led to the differences in confi-
dence level found in this study. Further research using the
present methodology should investigate whether this find-
ing generalizes to other contexts for confidence judgments.
If valid, the finding that confidence shows instability
over time in the context analyzed in the present study is of
interest and would probably not have been detected if the
analysis had been carried out only on a group level as in
previous research; the individuals whose confidence level
increased with time would have cancelled out those whose
confidence level decreased with time. In general, stability
is a question of degree and our results show that confidence
may be somewhat less stable over time than reported in
previous research. A task for future research is to investi-
gate the extent to which the confidence stability found by
Jonsson and Allwood (2003) depends on the differences in
the memory tasks used (word knowledge and logical/spa-
tial ability) and/or the different type of statistical analyses
employed. Yet, another task for future research may be to
investigate inter-individual differences in the stability of
confidence judgments in the context of different personal-
ity traits. By incorporating measures of personality traits
into the multilevel modelling, it might be possible to
explain why some persons’ confidence increase while
others’ decrease over time.
Speculatively, the reason why we did not find monotonic
stability for confidence could be that confidence to some
part is based on the experience-based cues and that these
experience-based cues in this paradigm, where different
questions were asked at each occasion, might have led to
differences in the rank order correlation.
In line with our second hypothesis predicting that bias
and calibration would exhibit strict stability over time, our
results did show strict intra-individual stability for the
absolute measures of confidence realism. On a general
level, these results are in line with previous studies
Table 4 Mean and standard deviation (SD) of the range values for
each measure for the adults, older children (10–11 years), and
younger children (8–9 years)
Measure Adults Older children Younger children
M SD M SD M SD
Proportion correct .163 .088 .164 .082 .198 .137
Confidence .093 .049 .075 .042 .096 .084
Bias .167 .085 .189 .103 .214 .145
Calibration .050 .039 .065 .044 .104 .076
Slope .116 .087 .080 .046 .075 .047
The range value was calculated for each individual by taking his or
her maximum score of the three measurement points for a measure
and subtracting the individual’s minimum score on the same measure
Cogn Process
123
reviewed above (e.g., Mengelkamp and Bannert 2010).
However, our finding of strict stability for the absolute
confidence measures adds to the previous research in this
context since earlier studies only investigated monotonic
stability, not strict stability. Assuming that our findings of
strict stability hold up in future research, strict stability for
absolute confidence realism may be somewhat more
directly useful for justice professionals than a finding of
monotonic stability would have been. In addition, our study
differed from the Mengelkamp and Bannert (2010) study in
that we measured the groups at three different time points,
meaning that we were able to explore whether any linear or
quadratic trends exist in the metacognitive measures (no
such trends were found). The stability of the absolute
measures is in line with our suggestion that information-
based cues might play a larger role in these measures than
experience-based cues (Koriat et al. 2008). As described
earlier, information-based cues might involve conceptions
(or ‘‘theories’’) about one’s own competences relevant for
the task and cues concerning such conceptions might be
more stable over time compared with experience-based
cues such as fluency that have been suggested to show
variability over time (Leonesio and Nelson 1990).
In the Introduction, we suggested that the stability of
bias and calibration may be enhanced by the fact that they
may inherit the degree of stability evident in the confidence
judgments. The reason for this expectation was that con-
fidence is a component of the bias and calibration mea-
sures, and that confidence on the basis of previous research
was expected to show stability. However, in contrast to
earlier research that utilized less advanced statistical
methods, we found fairly poor stability in confidence
judgments, which suggests that quasi-inheritance of sta-
bility from the confidence judgments may not be a good
explanation for the stability found in the absolute measures
in the present study. Similarly, the non-stability of pro-
portion correct answers shows that inherited variance from
proportion correct answers also does not help explain the
stability of bias and calibration.
Finally, in the context of our second hypothesis, we note
that the strict stability of realism identified among the
children for bias was not associated with good realism
considering that, on average, both the older and younger
children were overconfident. Thus, over time, the bias
shown by the three groups did not lessen or increase.
The third hypothesis was that the relative measure of
realism of confidence, that is, slope, would not show sta-
bility. In line with most previous research (e.g., Mengelk-
amp and Bannert 2010), we found some support for this
hypothesis since the ICC was low for the slope measures.
The participants’ performance with respect to slope is not
under the participants’ direct control to the same extent as
the absolute confidence accuracy measures, and this may
have promoted instability. To elaborate, improving one’s
general bias might simply involve adjusting one’s general
level of confidence (e.g., simply decreasing one’s confi-
dence if overconfidence is at hand), whereas improving
one’s slope may involve identifying when one’s answer is
correct or incorrect (for example, by making better use of
information to determine whether the answer is correct) by
using different types of cues considered relevant to the
task, and then lowering or increasing one’s confidence as
appropriate after considering these cues (Yates 1990). This
variability in possible cues may contribute to a lesser
degree of stability in slope compared to the measures of
absolute confidence realism. In addition, the described lack
of stability in slope corresponds to experience-based cues
playing a more important role for relative measures,
including slope, compared to more constant information-
based cues (Koriat et al. 2008). Experience-based cues,
such as fluency, have been suggested to show variability
over time (Leonesio and Nelson 1990), which may have
contributed to the instability of slope. Therefore, the reason
why we did not find monotonic stability for slope could be
that the slope is more driven by experience-based cues.
However, this is only a speculation and further research is
needed on this issue.
Our fourth hypothesis asserted that the adults would
show a higher degree of stability than the two groups of
children for confidence and realism. This hypothesis was
confirmed for calibration. The possibility that adults had an
ability to base their confidence judgments more on theory-
based cues (i.e., information-based cues) than children
(Koriat et al. 2008) may help explain this result. However,
no significant difference was found in the degree of sta-
bility between the adults and older children for bias, and
surprisingly, younger children were found to be less
unstable than adults in regard to slope. We do not have an
explanation for the last result, but interestingly the results
for slope showed very poor performance for all groups,
especially the children. The greater instability for adults
may be explained by a floor effect for the children.
Finally, when comparing the results for the Goodman–
Kruskal gamma correlations for our adult subsample across
the three time points to the Pearson correlations reported by
Jonsson and Allwood (2003) for their 18-year-old partici-
pants, we mostly found differences. First, we found low or
average correlations between the levels of confidence at the
three measurement points, whereas Jonsson and Allwood
(2003) found high correlations. Second, we found low
correlations for bias and even some negative correlations
for calibration between the three measurements, whereas
Jonsson and Allwood (2003) found medium or somewhat
high correlations for both these measures. However, a
further interesting difference can be noted. Though the
proportion of correct answers did not correlate between any
Cogn Process
123
of the time points in our study and even correlated (non-
significantly) negatively, these correlations were fairly high
in the results of the Jonsson and Allwood (2003) study.
This difference in results might help to explain the differ-
ences in the other correlations between the studies, and
speculatively, the difference itself could possibly be
explained by the fact that the type of knowledge measured
by Jonsson and Allwood (2003) (word knowledge and
logical/spatial ability) might be more available in memory
than the fairly neutral events in the film clip used in the
present study.
Future research should investigate more thoroughly the
extent to which differences in the type of retrieved mem-
ories contribute differently to stability over time for con-
fidence and various metacognitive measures. For example,
Perfect (2002) argued that important differences exist
between episodic and semantic memory with respect to
confidence realism. Such differences might render different
results for metacognitive stability for the different types of
memory material.
As described above, different questions were used at the
three time points. This was done in order to avoid the
testing effect, the effect that active repetition tends to
improve the correctness of later recall (Roediger and
Karpicke 2006a, b), and to avoid the influence of the
reiteration effect (Hertwig et al. 1997), an increase in a
person’s confidence in an assertion when they repeat it
(Odinot et al. 2009). Instead, we were interested in inves-
tigating how stable metamemory realism is across time
when the testing and reiteration effects are controlled.
However, it should be noted that the repetition effect may
not have been completely avoided since memories from the
same witnessed event was accessed on all three occasions.
Still, the results from this study have implications with
respect to the effects of the point in time after witnessing a
crime that the interview is administered (e.g., in close
proximity to when a crime was witnessed or some time
later) on metamemory realism.
Interestingly, the questions showed negative inter-item
covariance when it came to proportion correct. This is
usually the case when items have been incorrectly coded
which is not the case in this study. However, it could be
that the items’ difficulty level was not consistent, violating
the assumption of equal error variance that is required for
reliability testing (Sijtsma 2009). What is interesting is the
high reliability values found for confidence, supporting the
notion of a confidence trait suggested by researchers such
as Kleitman and Stankov (2007).
Some limitations of the present study should be noted.
First, similar to previous studies on confidence stability,
our memory task was a recognition task, and thus, the
stability found for some of the measures in the present
study may not generalize to an open free recall task or
to a memory task involving the recall of answers to
directed open questions. Moreover, the memory ques-
tions were new at each of the three occasions; thus, our
results may not generalize to situations where the same
questions are repeated on many occasions. Therefore, the
results from this study need to be replicated, preferably
with a larger sample and also with other types of
memory tasks and situations. Second, the present study
investigated stability over a time period of only 3 weeks.
In order to increase the relevance of our results for
forensic situations in which witnesses might give their
testimony, for example, half a year, or a year, after they
witnessed an event, it would be interesting to study
stability in confidence and metamemory realism over
longer time periods. Finally, it is not clear why cor-
rectness showed poor stability over the three measure-
ments in our study. To some extent, the use of different
knowledge questions at each of the three measurement
points (although these were controlled for level of dif-
ficulty) may have contributed to the lack of stability in
correctness. However, further analyses regarding the
effect of the order of the sets of questions did not reveal
any significant differences for the proportion of correct
answers, confidence level, or any of the realism measures
at any of the different time points.
Acknowledgments This study was partially funded by the Crime
Victim Compensation and Support Authority and partially by the
Swedish Research Council (VR) with grants to the second author.
References
Allwood CM (2010) Eyewitness confidence. In: Granhag PA (ed)
Forensic psychology in context. Willan Publishing, Devon,
pp 281–303
Allwood CM, Ask K, Granhag PA (2005) The cognitive interview:
effects on the realism in witnesses’ confidence in their free
recall. Psychol Crime Law 11:183–198. doi:10.1080/10683160
512331329943
Allwood CM, Granhag PA, Jonsson AC (2006) Child witnesses’
metamemory realism. Scand J Psychol 47:461–470. doi:
10.1111/j.1467-9450.2006.00530.x
Allwood CM, Innes-Ker AH, Homgren J, Fredin G (2008) Children’s
and adults’ realism in their event-recall confidence in responses
to free recall and focused questions. Psychol Crime Law
14:529–547. doi:10.1080/10683160801961231
Asendorpf JB (1989) Individual, differential, and aggregate stability
of social competence. In: Schneider BH, Attili G, Nadel J,
Weissberg R (eds) Social competence in developmental per-
spective. Kluwer Academic Publishers, Dordrecht, pp 71–86
Benjamin AS, Diaz M (2008) Measurement of relative metamne-
monic accuracy. In: Dunlosky J, Bjork RA (eds) Handbook of
metamemory and memory. Psychology Press, New York,
pp 73–94
Bornstein BH, Zickafoose DJ (1999) ‘‘I know I know it, I know I saw
it’’: the stability of the confidence–accuracy relationship across
domains. J Exp Psychol Appl 5:76–88. doi:10.1037/1076-898X.
5.1.76
Cogn Process
123
Boyce M, Beaudry JL, Lindsay RCL (2007) Belief of eyewitness
identification evidence. In: Lindsay RC, Ross DF, Don Read J,
Toglia MP (eds) Handbook of eyewitness psychology, vol 2.,
Memory for peopleLawrence Erlbaum Associates, Mahwah,
pp 501–525
Brewer N (2006) Uses and abuses of eyewitness identification
confidence. Leg Criminol Psychol 11:3–23. doi:10.1348/135532
505X79672
Cutler BL, Penrod SD, Stuve TE (1988) Juror decision making in
eyewitness identification cases. Law Hum Behav 12:41–55. doi:
10.1007/BF01064273
Duncan TE, Duncan SC, Strycker LA (2006) An introduction to latent
variable growth curve modelling: concepts, issues, and applica-
tions, 2nd edn. Lawrence Erlbaum Associates, Mahwah
Gabbert F, Memon A, Allan K (2003) Memory conformity: can
eyewitnesses influence each other’s memories for an event? Appl
Cogn Psychol 17:533–543. doi:10.1002/acp.885
Hertwig R, Gigerenzer G, Hoffrage U (1997) The reiteration effect in
hindsight bias. Psychol Rev 104:194–202. doi:10.1037/0033-
295X.104.1.194
Howie P, Roebers CM (2007) Developmental progression in the
confidence-accuracy relationship in event recall: insights pro-
vided by a calibration perspective. Appl Cogn Psychol
21:871–893. doi:10.1002/acp.1302
Jonsson A-C, Allwood CM (2003) Stability and variability in the
realism of confidence judgments over time, content domain, and
gender. Pers Individ Differ 34:559–574. doi:10.1016/S0191-
8869(02)00028-4
Juslin P, Olsson H, Winman A (1996) Calibration and diagnosticity of
confidence in eyewitness identification: comments on what
cannot be inferred from a low confidence-accuracy correlation.
J Exp Psychol Learn Mem Cogn 22:1304–1316. doi:
10.1037/0278-7393.22.5.1304
Kelemen WL, Frost PJ, Weaver CA III (2000) Individual differences
in metacognition: evidence against a general metacognitive
ability. Mem Cogn 28:92–107. doi:10.3758/BF03211579
Kelley CM, Lindsay DS (1993) Remembering mistaken for knowing:
ease of retrieval as a basis for confidence in answers to general
knowledge questions. J Mem Lang 32:1–24. doi:10.1006/
jmla.1993.1001
Kleitman S (2008) Metacognition in the rationality debate. Self-
confidence and its calibration. VDM Verlag Dr. Mueller,
Germany
Kleitman S, Stankov L (2007) Self-confidence and metacognitive
processes. Learn Individ Differ 17:161–173. doi:10.1016/j.lindif.
2007.03.004
Knutsson J, Allwood CM, Johansson M (2011) Child and adult
witnesses: the effect of repetition and invitation-probes on free
recall and metamemory realism. Metacogn Learn 6:213–228.
doi:10.1007/s11409-011-9071-y
Koriat A (1993) How do we know that we know? The accessibility
model of the feeling of knowing. Psychol Rev 100:609–639. doi:
10.1037/0033-295X.100.4.609
Koriat A (1994) Memory’s knowledge of its own knowledge: The
accessibility account of the feeling of knowing. In: Metcalfe J,
Shimamura AP (eds) Metacognition: knowing about knowing.
MIT Press, Cambridge MA, pp 115–135
Koriat A, Nussinson R, Bless H, Shaked N (2008) Information-based
and experience-based metacognitive judgments: Evidence from
subjective confidence. In: Dunlosky J, Bjork RA (eds) A
handbook of memory and metamemory. Lawrence Erlbaum
Associates, Mahwah, pp 117–136
Kuhn D, Dean D (2004) A bridge between cognitive psychology and
educational practice. Theory Pract 43:268–273
Kwok O, Underhill AT, Berry JW, Luo W, Elliott TR, Yoon M
(2008) Analyzing longitudinal data with multilevel models: an
example with individuals living with lower extremity intra-
articular fractures. Rehabil Psychol 53:370–386. doi:10.1037/
a0012765
Lane SM, Mather M, Villa D, Morita SK (2001) How events are
reviewed matters: effects of varied focus on eyewitness
suggestibility. Mem Cogn 29:940–947. doi:10.3758/BF03195
756
Leippe MR, Eisenstadt D (2007) Eyewitness confidence and the
confidence accuracy relationship in memory for people. In:
Lindsay RC, Ross DF, Read JD, Toglia MP (eds) Handbook of
eyewitness psychology, vol 2., Memory for peopleLawrence
Erlbaum Associates, Mahwah, pp 377–425
Leonesio RJ, Nelson TO (1990) Do different metamemory judgments
tap the same underlying aspects of memory? J Exp Psychol
16:464–470. doi:10.1037/0278-7393.16.3.464
Lichtenstein S, Fischhoff B (1977) Do those who know more also
know more about how much they know? Organ Behav Hum
Perform 20:159–183. doi:10.1016/0030-5073(77)90001-0
Lichtenstein S, Fischhoff B, Phillips LD (1982) Calibration of
probabilities: the state of the art of 1980. In: Kahneman D,
Slovic P, Tversky A (eds) Judgment under uncertainty: heuristics
and biases. Cambridge University Press, Cambridge, pp 306–334
Lindsay RCL, Wells GL, Rumpel CM (1981) Can people detect
eyewitness identification accuracy within and across situations?
J Appl Psychol 66:79–89. doi:10.1037/0021-9010.66.1.79
Marsh EJ, Tversky B, Hutson M (2005) How eyewitnesses talk about
events: implications for memory. Appl Cogn Psychol
19:531–544. doi:10.1002/acp.1095
Mengelkamp C, Bannert M (2010) Accuracy of confidence judg-
ments: stability and generality in the learning process and
predictive validity for learning outcome. Mem Cogn
38:441–451. doi:10.3758/MC.38.4.441
Nelson TO (1984) A comparison of current measures of the accuracy
of feeling-of-knowing predictions. Psychol Bull 95:109–133.
doi:10.1037/0033-2909.95.1.109
Nelson TO (1996) Gamma is a measure of the accuracy of predicting
performance on one item relative to another item, not of the
absolute performance on an individual item. Appl Cogn Psychol
10:257–260. doi:10.1002/(SICI)1099-0720(199606)10:3
Odinot G, Wolters G, Lavender T (2009) Repeated partial eyewitness
questioning causes confidence inflation but not retrieval-induced
forgetting. Appl Cogn Psychol 23:90–97. doi:10.1002/acp.1443
Paterson HM, Kemp RI (2006) Co-witness talk: a survey of
eyewitnesses’ discussion. Psychol Crime Law 12:181–192. doi:
10.1080/10683160512331316334
Perfect TJ (2002) When does eyewitness confidence predict perfor-
mance? In: Perfect TJ, Schwartz B (eds) Applied metacognition.
Cambridge University Press, Cambridge, pp 95–120
Quene H, van den Bergh H (2004) On multi-level modeling of data
from repeated measures designs: a tutorial. Speech Commun
43:103–121. doi:10.1016/j.specom.2004.02.004
Raudenbush SW (2001) Comparing personal trajectories and drawing
causal inferences from longitudinal data. Annu Rev Psychol
52:501–525. doi:0.1146/annurev.psych.52.1.501
Roediger HL III, Karpicke JD (2006a) Test-enhanced learning.
Taking memory tests improves long-term retention. Psychol Sci
17:249–256. doi:10.1111/j.1467-9280.2006.01693.x
Roediger HL III, Karpicke JD (2006b) The power of testing memory.
Basic research and implications for educational. Perspect
Psychol Sci 1:181–210. doi:10.1111/j.1745-6916.2006.00012.x
Schlottmann A, Anderson NH (1994) Children’s judgments of
expected value. Dev Psychol 30:56–66. doi:10.1037/0012-
1649.30.1.56
Schneider W (2008) The development of metacognitive knowledge in
children and adolescents: major trends and implications for
education. Mind Brain Educ 2:114–121
Cogn Process
123
Schneider W, Lockl K (2002) The development of metacognitive
knowledge in children and adolescents. In: Perfect T, Schwartz
B (eds) Applied metacognition. Cambridge University Press,
Cambridge, pp 224–257
Schraw G (2009) Measuring metacognitive judgments. In: Hacker DJ,
Dunlosky J, Graesser AC (eds) Handbook of metacognition in
education. Routledge, New York, pp 415–429
Sijtsma K (2009) On the use, the misuse, and the very limited
usefulness of Cronbach’s alpha. Psychometrika 74:107–120. doi:
10.1007/s11336-008-9101-0
Sporer SL, Penrod S, Read D, Cutler B (1995) Choosing, confidence,
and accuracy: a metaanalysis of the confidence-accuracy relation
in eyewitness identification studies. Psychol Bull 118:315–327.
doi:10.1037/0033-2909.118.3.315
Stankov L, Crawford JD (1996) Confidence judgments in studies of
individual differences. Pers Individ Differen 21:971–986. doi:
10.1016/S0191-8869(96)00130-4
Thompson WB, Mason SE (1996) Instability of individual differences
in the association between confidence judgments and memory
performance. Mem Cogn 24:226–234. doi:10.3758/BF03200883
Tisak J, Meredith W (1990) Descriptive and associative development
models. In: Von Eye A (ed) Statistical methods in longitudinal
research, vol 2. Academic Press, Boston, pp 387–406
Wells GL, Bradfield AL (1999) Distortions in eyewitnesses’ recol-
lections: can the postidentification feedback effect be moder-
ated? Psychol Sci 10:138–144. doi:10.1111/1467-9280.00121
Wells GL, Lindsay RCL, Ferguson TJ (1979) Accuracy, confidence,
and juror perceptions in eyewitness identification. J Appl
Psychol 64:440–448. doi:10.1037/0021-9010.64.4.440
Wells GL, Olson EA, Charman SD (2002) The confidence of
eyewitnesses in their identifications from lineups. Curr Dir
Psychol Sci 11:151–154. doi:10.1111/1467-8721.00189
Wells GL, Memon A, Penrod SD (2006) Eyewitness evidence:
improving its probative value. Psychol Sci Publ Int 7:45–74
Yates JF (1990) Judgment and decision making. Prentice Hall,
Englewood Cliffs
Yates JF (1994) Subjective probability accuracy analysis. In: Wright
G, Ayton P (eds) Subjective probability. Wiley, New York,
pp 381–410
Cogn Process
123