educational research – short report · web view2. sentence completion – selecting one word...
TRANSCRIPT
Correlation and consistency in reasoning test
scores over time
Dr Steve Strand
Paper presented at the Annual Conference of the British Educational Research Association, University of Exeter, England, 12-14 September 2002
Address for Correspondence
Dr Steve Strand, Senior Assessment Consultant, nferNelson
The Chiswick Centre, 414 Chiswick High Road
LONDON W4 5TF
Tel: (020) 8996 8414
e-mail: [email protected]
Keywords: Reasoning ability, correlation, stability, change, consistency, school effects, value added
CORRELATION AND CONSISTENCY IN REASONING
TEST SCORES OVER TIME
ABSTRACT
UK schools have a long history of using reasoning tests, most frequently of Verbal
Reasoning (VR), Non Verbal Reasoning (NVR), and to a lesser extent Quantitative
Reasoning (QR). Results are used to identify students’ learning needs, for grouping students,
for identifying underachievement, and for providing indicators of future academic
performance (Fernandes & Strand, 2000). Despite this widespread use there is little data on
the long term consistency of VR, QR and NVR as discrete abilities. This study compares the
performance of over 10,000 pupils who completed the Cognitive Abilities Test Second
Edition (CAT2E) in year 6 (age 10+) and year 9 (age 13+) and GCSE public examinations
in year 11 (age 15+). The results reveal very high correlations in scores over time, ranging
from .87 for VR to .76 for NVR, but also show around one-quarter of pupils on the VR test
and one-third of pupils on the QR and NVR tests changed their scores by eight or more
standard score points. Schools accounted for only a small part of the total variation in
reasoning scores, although they accounted for a much greater proportion of the variation in
measures of attainment such GCSE. School effects on pupils’ progress in the reasoning tests
between age 10 and age 13 were relatively modest. Some practical and policy implications
for schools are discussed.
Correlation and consistency in reasoning ….. Page 2
CORRELATION AND CONSISTENCY IN REASONING TEST
SCORES OVER TIME
INTRODUCTION
UK schools have a long history of using tests to assess students’ reasoning ability. Reasoning
tests attempt to assess the perceiving of relationships among abstract elements and symbols.
They differ from attainment tests in that the test material is not intended to be a sampling of
what is taught to any particular age group in school, rather it attempts to minimise the effect
of specific curricular experience. The tests use general, not specialised, knowledge that
individuals in a particular age group could have acquired from a broad variety of experiences
in or out of school. The basic test elements are kept relatively simple, clear and familiar with
the intended emphasis on discovery of relationships and flexibility of thinking. The most
frequently used tests are of Verbal Reasoning (VR), Non Verbal Reasoning (NVR), and to a
lesser extent Quantitative Reasoning (QR). Verbal tests use symbols representing words, QR
tests use symbols representing numbers or quantities, while NVR tests use symbols
representing spatial, geometric or figural patterns.
Prior to the mid-1970’s reasoning tests were a formal part of the process of selection to
secondary school at age 11 in most areas of the UK, although this practice has largely
disappeared with comprehensive education. However reasoning tests remain in widespread
use for the diagnosis of learning needs, for grouping students, for identifying
underachievement, and for providing indicators of future academic performance (Strand,
2000; Fernandes & Strand, 2000).
Correlation and consistency in reasoning ….. Page 3
Despite their widespread use, there is relatively little empirical data on the consistency over
time of group tests of VR, QR and NVR as discrete reasoning abilities. Much has been
written about the stability or otherwise of full scale IQ scores. Early studies are well
reviewed by Pinneau (1961), Anastasi (1976) and Vernon (1979). As would be expected,
test-retest correlations are higher the shorter the interval between tests, and tend to be higher
for older children. Vernon concludes that typical correlations from 6 to 10 years, or from 10
to 17 years, are approximately .70 (Vernon, 1979, p75). More recent studies (e.g., Hindley &
Owen, 1978; Moffitt, Caspi, Harkness & Silva, 1993) have produced even higher
correlations. For example Moffitt et. al. tested nearly 1,000 pupils on the Weschler
Intelligence Scale for Children - Revised (WISC-R) at age 7, 9, 11 and 13, and report
correlation coefficients ranging from .74 between age 7 and 13 up to .85 between age 9 and
11 and between age 11 and 13.
Full scale IQ tests, such as the Weschler and Stanford-Binet, cover a wide range of item
types, sometimes require oral responses from the testee, necessitate the manipulation of
material and are administered on a one-to-one rather than a group basis. Similar levels of
consistency are not assured for group administered reasoning tests. Given the widespread use
of group tests within educational settings, data on their long term consistency is needed.
Additionally, factorial studies in which a variety of ability factors are distinguished (Vernon,
1961), and the distinction between “fluid” and “crystallised” intelligence proposed by Cattell
(1963), suggest it is particularly desirable to have information separately on VR, QR and
NVR reasoning tests. Some studies (Hopkins & Bracht, 1975) have indicated higher
consistency for VR rather than NVR test scores. This may have important practical as well as
theoretical implications.
Correlation and consistency in reasoning ….. Page 4
The current study looks in particular at the consistency of scores on the Cognitive Abilities
Test Second Edition (CAT2E) between the ages of 10 and 13 to shed light on the following
questions:
What is the correlation between pupils’ reasoning test scores at age 10 and age 13? Are
there differences between verbal, quantitative and non-verbal reasoning tests in their
consistency over time?
What is the extent of change in individual pupils’ scores over time? How should this
variation be interpreted?
What is the influence of the school on pupils’ performance in reasoning tests? Can the
results be used to assess the ‘value-added’ by the school?
What are the implications for the interpretation and use of reasoning test scores?
What are the implications for educational practice and policy?
METHOD
The dataset
Data was collected over a three year period from a large Local Education Authority (LEA) in
the South East of England. The CAT2E is used with all pupils in the LEA on two occasions:
Level C is completed in October of year 6 at primary school (average pupil age 10:6) and
Level F in October of year 9 at secondary school (average pupil age 13:6). Both sets of
CAT2E scores, along with subsequent year 11 GCSE examination results, were collected on
three separate cohorts of pupils who completed their GCSE examinations in summer 1998,
1999 and 2000 respectively. A total of 10,644 pupils were included in the study. For some
analyses pupils attending Pupil Referral Units or special schools were excluded, as were
pupils with a missing score on any of the three CAT batteries at each age, leaving a core
Correlation and consistency in reasoning ….. Page 5
sample of 10,621 pupils attending 25 mainstream secondary schools. The average year group
within schools comprised 144 pupils, with two-thirds of year groups in the range 101 to 187
pupils.
The Cognitive Abilities Test Second Edition (CAT2E)
The Cognitive Abilities Test Second Edition (CAT2E) (Thorndike, Hagen & France, 1986)
provides an assessment of reasoning skills with separate Verbal Reasoning (VR),
Quantitative Reasoning (QR) and Non Verbal Reasoning (NVR) batteries. The pupil’s mean
score over the three batteries (mean CAT score) is also calculated. The test is divided into six
levels A-F, and is UK standardised across the age range 7:6 to 15:9. The CAT is the most
widely used test of reasoning ability in the UK, with 800,000 students assessed during the
academic year 1999/2000.
The Verbal Reasoning (VR) battery consists of four tests:
1. Vocabulary – choosing a simile from a list of five possibilities (e.g., wish : agree,
bone, over, want, waste)
2. Sentence Completion – selecting one word from a list of five (e.g., John likes to ____
a football match: eat, help, watch, read, talk)
3. Verbal Classification – given three or four words belonging to one class, select which
further word from a list of five belongs to the same class (e.g. eye, ear, mouth : nose,
smell, head, boy, speak)
4. Verbal Analogies – given one pair of words, complete a second pair from five
possibilities (e.g., big->large; little->? : boy, small, late, lively, more))
The Quantitative Reasoning (QR) battery consists of three tests:
Correlation and consistency in reasoning ….. Page 6
1. Quantitative Relations: Given two quantities decide whether one is greater, equal to
or less than the other (e.g., ¼ vs. ½ );
2. Number Series: select one from five possible choices to complete the series (e.g.,
2,4,6,8,?)
3. Equation Building: utilise the given elements to create a true equation (e.g., 2 2 3 +
x : 6, 8, 9, 10, 11)
The Non Verbal Reasoning (NVR) battery consists of three tests:
1. Figure Classification: given three or four shapes belonging to one class, select which
further shape from a list of five belongs to the same class;
2. Figure Analogies: given one pair of shapes, complete a second pair from five
possibilities
3. Figure Synthesis : Given different shaped ‘pieces’, decide whether target figures can
or cannot be constructed from them.
The reliability of the CAT2E
Any interpretation of long term consistency in CAT2E scores has to be considered in the
context of the reliability of the test. Variation in test scores over time will be due to a variety
of factors. At least some of these factors will be transient, including variation in pupils’
motivation or affective state at the time of taking the test/s, possible variations in test
administration, errors in marking etc1. For this reason, published tests should provide clear
measures of their reliability. Such reliability data for the CAT2E comes from a number of
sources. Internal consistency estimates for the CAT2E are high, averaging 0.94, 0.91 and
0.93 for VR, QR and NVR batteries respectively (Thorndike et. al., 1986). Six month test-
retest correlations are also available from the US version (CogAT, Form 3) for each battery
at each test level (reported by Sax, 1984). Correlations for VR range across levels from .85 to
Correlation and consistency in reasoning ….. Page 7
.93, for QR from .78 to .88, and for NVR from .81 to .89. In all instances the lower values
are for the youngest (age 8) pupils. These figures indicate that the CAT2E is highly reliable,
and effectively set a ceiling on the longer term correlations we might expect from the three
year test-retest data reported in this study.
RESULTS
Consistency over time
Table 1 presents the correlations between the different batteries at each age and the
correlations between batteries over time. The diagonal figures in bold show the three year
test-retest correlations. These reveal correlations of .87 for verbal reasoning, .79 for
quantitative reasoning, .76 for non-verbal reasoning and .89 for mean CAT score. Given the
three year time period between the tests, which includes the period of transfer between
primary and secondary school, these correlations are remarkably high. It is notable that the
correlation for VR is substantially higher than for either QR or NVR. These correlations
indicate substantial consistency in pupils’ reasoning scores over the three year period.
<------------------------------------------>
INSERT TABLE 1 ABOUT HERE
<------------------------------------------>
The intercorrelations between the three CAT batteries are shown off the diagonal in Table 1,
with the upper right cells showing the intercorrelations at age 13 and lower left cells showing
the intercorrelations at age 10. These range from .64 between VR and NVR at age 13, to .72
between VR and QR at age 10. While these correlations are highly significant, there are
sufficiently low to suggest real differences in pupils performance in the verbal, quantitative
and non-verbal tests.
Correlation and consistency in reasoning ….. Page 8
There was no significant difference in the test-retest correlations of boys and girls, which
were identical for VR and differed only slightly for QR (.80 vs. .78) and NVR (.75 vs. .77)
for boys and girls respectively.
Individual variability
A high correlation does not mean that pupils’ scores are constant and unchanging, indeed
high correlation coefficients can conceal important change within individuals over time.
Change scores, the difference between a pupil’s standardised score on the retest and their
initial standardised score, were calculated for each battery. Table 2 indicates the proportion
of pupils with a change score in specified ranges.
<------------------------------------------>
INSERT TABLE 2 ABOUT HERE
<------------------------------------------>
The extent of these changes need to be considered in relation to the standard error of
measurement (SEM) of the test. The SEM for each CAT battery is estimated to be four
standard score points (Thorndike et. al., 1986, p49). Given this figure, the 90% confidence
band for any pupil’s test score is within the range 1.645*SEM, or plus or minus seven
standard score points. Change scores within this range represent fluctuations to be expected
from measurement error. The percentage of pupils with a retest score within 7 points of their
initial score is 72% for VR, 66% for QR and 63% for NVR2. While this indicates
substantially consistency, it also shows a significant change in score for approximately one-
quarter of pupils on the VR test and one-third of pupils on the QR and NVR tests.
Correlation and consistency in reasoning ….. Page 9
The change scores for mean CAT score are also given in Table 2. These are very low with
84% of retest scores within plus or minus 7 points of initial score, and 95% within plus or
minus 10 points. However to some extent the change scores are reduced because the mean
score across batteries necessarily has a smaller standard deviation (SD) than the individual
batteries. Thus at age 13 the SD of mean CAT score was only 11.5, compared to 13.6, 12.8
and 12.5 for the VR, QR and NVR tests respectively.
Regression to the mean
When interpreting simple change scores between tests taken on two occasions, it is important
to consider the influence of regression to the mean. Regression to the mean is a statistical
artefact reflecting the fact that, since the correlation between scores at any two ages is always
less than perfect, the second test score will tend to be closer to the population mean than the
first test score. This is illustrated in Figure 1 which shows the effect directly by regressing
second test score on initial test score for VR, QR and NVR respectively. We can see that
regression to the mean is: (i) greatest where initial scores are towards the extremes, and; (ii)
greater for NVR, QR and VR tests in that order, reflecting their lower test-retest correlations
respectively. As a result of regression to the mean, change scores tend to be positive for low
initial scores and negative for high initial scores.
<------------------------------------------>
INSERT FIGURE 1 ABOUT HERE
<------------------------------------------>
What are the implications of regression to the mean? First, the confidence bands for a pupil’s
score will not be symmetrical over the whole score range but will tend to be skewed at the
extremes. Table 3 presents confidence bands calculated across a range of scores for each
CAT2E battery. It can be seen that only around the mean score of 100 are the confidence
Correlation and consistency in reasoning ….. Page 10
bands symmetrical at +7 and – 7 standard score points respectively. As scores move away
from the mean the bands become asymmetrical so that a greater part of the confidence band
lies in the direction of the average score for the test. It can also be seen that this effect is most
pronounced for the NVR test and least pronounced for the VR test. Practitioners wishing to
evaluate whether an individual pupil’s change score is significant, should use the specific
values in this table rather than the applying the overall confidence band.
<------------------------------------------>
INSERT TABLE 3 ABOUT HERE
<------------------------------------------>
Second, and more importantly, a consequence of regression to the mean is that any group of
low scorers can be expected to show a gain in score on a second occasion without any
intervention. Equally a group of high scorers can be expected to show a drop in score on a
second occasion. Practitioners must therefore be cautious in their interpretation of simple
change scores for groups of pupils. To attribute any change to an educational intervention,
such as a thinking skills or literacy training programme, will require sophisticated statistics
such as analysis of covariance with initial test score as a covariate, or comparisons with a
matched ‘control’ group who do not receive the intervention.
School effects
Multi-level regression models were computed to examine the degree of variance at each level
within the three hierarchical levels that describe the data, i.e., pupils (level 1) grouped within
year cohorts (level 2) grouped within schools (level 3). Separate models were run for each of
the age 13 CAT2E scores and for the same pupil’s GCSE public examination results. The
first set of models include only the intercept terms at each level to allow a variance
components analysis, as shown in Table 4. There are two key findings.
Correlation and consistency in reasoning ….. Page 11
<------------------------------------------>
INSERT TABLE 4 ABOUT HERE
<------------------------------------------>
First, the variance attributable to the school level varies significantly across the three CAT
batteries. The school level variance for VR and QR (around 3.7%) is almost twice the school
level variance for NVR (around 2%). This might be expected since, in contrast to the NVR
test, both the verbal and quantitative reasoning tests require some familiarisation with basic
literacy and numeracy concepts and may therefore be influenced to a greater degree by the
school and curriculum exposure.
Second, the school level variance in reasoning test scores, even for VR and QR, is small
compared to the school level variance for measures of pupil attainment. For GCSE total
points score, the proportion of variance accounted for at the school level was almost 7%,
nearly twice the amount for VR and QR, and over three times the amount for NVR. For
individual GCSE subjects, the following proportions of variance at the school level were
found: English (6%), Science (6%), Design & Technology (8%), History (8%), Geography
(9%), English Literature (10%), Mathematics (11%), French (11%) and Art (16%). Clearly
schools account for a much smaller proportion of the variance in reasoning test score than
they do for attainment tests such as public examinations.
A further set of ‘value added’ analyses were completed by including initial test score in the
above models. We can then determine whether there was any significant variation between
secondary schools in pupils’ progress in the reasoning tests between age 10 and 13. There
were statistically significant associations between secondary school and pupil progress for all
three batteries, although the magnitude of the effect varied across batteries. A conservative
Correlation and consistency in reasoning ….. Page 12
estimate of the school effect can be generated by comparing the average residual for the five
schools with the highest residuals3 against the average for the five schools with the lowest
residuals (see Table 5). The difference in pupil progress between the most and least
‘effective’ schools in these terms was around 2.4 and 2.9 standard score points for VR and
NVR respectively. However for both tests only five of the 25 schools could be reliably
distinguished from the others in terms of pupil progress (three schools with significantly
above average and two schools with significantly below average progress). The picture was
markedly different for QR. Here the ‘school effect’ was much greater, with a difference of
4.3 standard score points between the five most and five least ‘effective’ schools. A total of
14 schools had significant residuals, seven schools recording significantly above average and
seven significantly below average progress.
----------------------------------------
INSERT TABLE 5 ABOUT HERE
----------------------------------------
Correlations between the school level residuals for each test were calculated to determine
how consistent schools were in their effects across the different CAT batteries. While there
was a significant correlation between the school residuals for VR and QR (r=0.43, p<.03)
there was no significant correlation between the residuals for VR and NVR (r=0.35) or QR
and NVR (r=.29). Some schools were more effective than others in promoting academic
reasoning as indexed by the verbal and quantitative tests, but these were not necessarily the
schools with significant changes in NVR score.
Correlation and consistency in reasoning ….. Page 13
DISCUSSION
The main findings of the study can be summarised as follows:
The correlation between mean CAT score at 10:6 and 13:6 was .89, indicating a high
degree of consistency in pupil’s reasoning test scores over the three year period. The
correlation was particularly strong for VR at .87, but was also high for QR and NVR
at .79 and .76 respectively.
High levels of correlation do not mean there is no change in pupils’ scores over time.
Even for the VR test around one-quarter of pupils recorded a change of eight or more
standard score points. This rose to around one-third of pupils for the QR and NVR tests.
Because of regression to the mean, simple change scores should not be used to evaluate
the outcome of any educational intervention. Sophisticated statistics or the use of control
groups are required for such purposes.
The intercorrelations between the three CAT batteries ranged from .64 between VR and
NVR to .72 between VR and QR. While these correlations are highly significant, they are
sufficiently low to suggest real differences in pupils’ reasoning skills in the verbal,
quantitative and non-verbal domains.
There were significant differences between schools in pupils’ progress on the reasoning
tests between age 10 and 13. The variation between the most and least ‘effective’ schools
was around two standard score points for VR, three points for NVR and four points for
QR. Some schools were more effective than others in promoting progress in academic
reasoning as indexed by increases in verbal and quantitative tests, but these were not
necessarily the schools with significant changes in NVR score.
Correlation and consistency in reasoning ….. Page 14
Despite the above, school effects on reasoning score performance are small in
comparison to their effects on pupils’ attainment, such as performance in GCSE public
examinations.
There are three areas that deserve further detailed discussion.
Consistency in reasoning scores over time
The results indicate high consistency over the three years age 10 to 13. The correlation for
mean CAT score (.89) is at the top of the range that is observed for correlations between the
results of individually administered IQ tests (Vernon, 1979). For example, Rees & Palmer
(1970) report a meta-analysis of five longitudinal studies using scores on the Stanford-Binet
test. They report correlations of .81 between scores at age 6 and 12, and .81 between scores
at age 12 and 17. Hindley & Owen (1978) report a correlation of .69 between Stanford-Binet
score at age 11 and an individually administered AH4 test at age 14, although they consider
this an underestimate of consistency because of the change of test between the two occasions.
When the same pupils had been tested at age 8 and age 11 on the Stanford-Binet the
correlation was .89, and when tested between age 14 and age 17 on the AH4 the correlation
was .87. Moffitt et. al., (1993) report a correlation of .84 between WISC-R full-scales IQ at
age 11 and age 13. Given that group administered tests reportedly produce much lower
correlations over time than individually administered tests (Vernon, 1979) this is a very
substantial finding for the CAT.
The test-retest correlation for VR (.87) is notably higher than for either QR (.79) or NVR
(.76). Hopkins & Bracht (1975) report test-retest correlations between the group administered
Lorge-Thorndike VR and NVR tests at age 10 and age 13 of .79 for VR and .61 for NVR.
Correlation and consistency in reasoning ….. Page 15
Thorndike et. al. (1986) report that analysis from an LEA wide study suggested “two to three
year correlations of the order of .80, .75 and .65 for VR, QR and NVR respectively” (p94).
The current author has also undertaken an analysis of the technical data reported in the
teacher manuals for the National Foundation for Educational Research (NFER) VR and NVR
test series (Hagues, Smith & Courtneay, 1993; Smith & Hagues, 1993). The two series
consist of separate tests for the 8-9, 10-11 and 12-14 age groups. The average 12 month test-
retest correlation for the three VR tests is .85 against .76 for the three NVR tests.
It is possible that the lower correlation for NVR reflects the comparative novelty of the NVR
material which is not part of the formal curriculum in the UK. This relative novelty can lead
to greater practice effects. For example, the analysis of the 12 month test-retest data for the
NFER VR and NVR series revealed an average increase of around one standard score point
for the VR tests compared with an average increase of over four standard score points for the
NVR tests. In the current study the average changes were smaller but showed the same
pattern of greatest increase for NVR, with mean change scores of –0.6, 0.3 and +1.2 for VR,
QR and NVR respectively. Practice effects will not necessarily lower test-retest correlations.
However if pupils with little experience of NVR item types make particular gains in the
second test, then this will lower the overall correlation, and their does appear to be evidence
of this4. Greater practice effects may be one reason underlying the lower correlations for
NVR. However, we may also hypothesise that VR scores are particularly stable because
social and educational pressures emphasise verbal skills in particular, and such skills are
prerequisites for subsequent learning in a cumulative manner. As Anastasi (1976) states:
“Not only does the individual retain prior learning, but much of his prior learning provides
tools for subsequent learning. Hence the more progress he has made in the acquisition of
intellectual skills and knowledge at any one point in time, the better able he is to profit from
Correlation and consistency in reasoning ….. Page 16
subsequent learning experiences.” (p328). In short, nothing succeeds like success. While this
is true of a range of skills, the pre-eminence of verbal skills for academic work may make
verbal reasoning particularly stable.
The correlations reported here are of course specific to the age range tested and it is not
assured that similar correlations would be found at other ages. Available data tends to suggest
that scores would be less stable at earlier ages, and this may be particularly pronounced for
NVR (Jensen, 1980; Vernon, 1979; Hopkins & Bracht, 1975; Sax, 1984). Care needs to be
exercised therefore in generalising from the current results, particularly to younger age
groups.
School effects
Some authors (e.g., Primrose, 2000) have suggested that changes in reasoning test scores
may be used as a direct means of assessing the ‘value-added’ by a school, given that the
provision of quality teaching along with a variety of learning opportunities should enhance
the cognitive skills of pupils. The data presented here suggests there were significant
differences between schools in pupils’ progress in the reasoning tests between age 10 and age
13. These school level differences may be statistically significant, but what is their
educational significance? What does an average difference of 3 standard score points on the
CAT represent in terms of some more widely known educational standard? There is a large
body of work relating CAT scores and GCSE examination results (e.g., Fernandes & Strand,
2000). From this we know, for example, that on average 55% of pupils with a mean CAT
score of 98 achieve 5 or more GCSE A*-C grades, while an average of 67% of pupils with a
mean CAT score of 101 achieve this level. These differences are not large, but neither are
they trivial. A school that can on average add three standard score points to CAT scores
Correlation and consistency in reasoning ….. Page 17
between year 6 and year 9 may also be likely to increase pupils’ GCSE results by a
significant margin.
However, it is clear that school effects on pupils’ reasoning scores are small in relation to
their effects on pupils’ attainment. 3.7% of the variance in mean CAT score was at the school
level compared to around 7% of the variance in GCSE points score, and an even higher
proportion in individual GCSE subjects such as design & technology, history, geography,
English literature, mathematics, French and art. We should not be surprised that the school
exerts a particularly strong effect on outcomes such as public examination results, since these
are tests of what is directly taught in schools. For the same reason it is not surprising that the
secondary school exerts a particularly strong effect on pupils’ progress in those reasoning
skills most closely allied to the mathematics curriculum, with the school level accounting for
three times the proportion of variance in progress in QR compared to VR or NVR. This is
consistent with the general finding that mathematics is a subject most uniquely learned at
school (Brandsma & Knuver, 1989; Strand, 1998) while language skills are more a joint
operation of the school and the home/wider culture, and non-verbal skills seem minimally
influenced by school. These findings emphasise that reasoning tests are particularly well
placed to act as baseline assessments for secondary schools, since they take account of
important variation in school intakes which is largely, although not exclusively, outside of
the schools control.
Finally, it is notable that school accounted for only 2% of the variance in NVR score
compared to 3.7% of the variance in VR and QR scores. This finding supports Cattell’s
(1963) theory that NVR tasks are excellent measures of fluid-analytic abilities (gf) and are
less affected by acculturation, including formal schooling, than verbal and quantitative tasks.
Correlation and consistency in reasoning ….. Page 18
VR and QR assess inductive and deductive reasoning, which Cattell would classify as fluid-
analytic abilities, but using acquired verbal and numerical concepts, and may be more
broadly defined as crystallised abilities (gc).
Practical and policy implications
1
?. Pupil’s CAT2E answer sheets can be submitted for computer scoring or can be hand
marked. The results for all pupils in this study were computer scored. Compared to
human marking this minimises potential errors in the consistent and accurate application
of the marking key for each question, the totalling of raw scores for each test and
battery, the calculation of students’ age on the day of the test and the conversion from
raw scores to standard age scores.
2 . Several previous studies have reported the percentage of pupils with change scores
within the range plus or minus 10 standard score points. For comparative purposes, the
relevant figures here are 86%, 81% and 79% for VR, QR and NVR respectively.
3 The residual is the difference between the predicted age 13 score based on age 10 score,
and the actual age 13 score. Residuals will be positive where the actual score is higher
than predicted and negative otherwise.
4 4. This interpretation is supported by a significant negative correlation between
‘volatility’, as indicated by absolute change in NVR score irrespective of sign, and initial
test score (r=-.11, p<.0001). Absolute amount of change tended to be greatest for those
pupils with low initial NVR scores. This was not seen for QR, and for VR there was
Correlation and consistency in reasoning ….. Page 19
Reasoning tests provide an excellent baseline assessment on entry to secondary school
because they are strongly related to pupils’ subsequent performance in national tests and
examinations, but relatively uninfluenced by school effects and prior educational
experiences.
The high correlation of reasoning scores with subsequent national tests and examinations
allows the calculation of robust indicators which can support schools in the process of
individual pupil target setting (Smith, Fernandes & Strand, 2001; Fernandes & Strand,
2000). Such indicators need to be interpreted carefully since reasoning ability is only one
of a large number of factors that influence exam performance, including motivation and
effort, opportunity, teaching quality, parental support and many others.
The high test-retest correlations suggest that a programme of annual testing is
unnecessary. However the extent of individual change suggests that it is important to
retest at times when key educational decisions are required. If one of the purposes of the
reasoning tests is to generate indicators to inform target setting, then schools should retest
pupils during year 9 (age 14) to ensure their indicators are up-to-date and relevant.
The relatively low correlation between VR and NVR (.65) indicates the importance of
using both a VR and a NVR test in the assessment of students’ reasoning ability. Verbal
tests, because of the emphasis which is placed on reading ability and familiarity with
language, can unduly influence the performance of pupils with English as an additional
language, pupils with poor primary school experience or those thought to have specific
difficulties with language based work (Elliott, 1990).
The relative stability of reasoning test scores has no direct bearing on the long running
debate over the relative influence of hereditary and environment on intelligence. The fact
that reasoning test scores are highly correlated over time may simply reflect systematic
and stable environmental influences, such as the socio-economic status of home.
actually more volatility among those with high initial scores (r=.12, p<.0001).
Correlation and consistency in reasoning ….. Page 20
Conclusion
It is unfortunate that reasoning tests became so strongly associated with the period of
selective secondary education in the UK as this has left a negative perception of the tests with
many educators and others. However antipathy to the tests themselves is essentially a bad
case of shooting the messenger. We should remember that the tests were serving a political
purpose, and to support their use for selection claims were made for the tests that were
heavily contested (e.g., Kamin, 1974). For example, that reasoning test measured ‘innate’
ability, that pupil’s performance was not amenable to coaching or instruction, and that the
scores were unchanging indicating an invariant capacity of the pupil. Modern interpretations
view reasoning tests as reflecting the pupils’ experiences up to the time of testing rather than
providing an indication of fixed potential (Whetton, 1995). The current study serves to
emphasis the degree of individual pupil change that can underlie even very high correlations.
When used appropriately, reasoning tests have a positive and valuable role to play in
education.
Correlation and consistency in reasoning ….. Page 21
REFERENCES
Anastasi, A. (1976). Psychological testing (fourth edition). New York: MacMillan.
Brandsma, H. P., & Knuver, J. (1989). Effects of school classroom characteristics on pupil
progress in language and arithmetic. International Journal of Educational Research,
13, (7), 777-788.
Cattell, R. B. (1963). Theory of fluid and crystallised intelligence: a critical experiment.
Journal of Educational Psychology, 54, 1-22.
Elliott, C. D. (1990). Nonverbal tests. In: Walberg, H. J. & Haertel, G. D. (Eds).
International Encyclopaedia of Educational Evaluation. Oxford: Pergamon Press.
Fernandes, C. & Strand, S. (2000.). Cognitive abilities Test and KS3/GCSE indicators:
Technical Report. Windsor: nferNelson.
Hague, N., Smith, P. & Courtney (1993). NFER verbal reasoning test series. Windsor:
nferNelson.
Hindley, C. B., & Owen, C. F. (1978). The extent of individual changes in IQ for ages
between 6 months and 17 years, in a British longitudinal sample. Journal of Child
Psychology and Psychiatry, 19, 329-350.
Hopkins, K. D., & Bracht, G. H. (1975). Ten year stability of verbal and nonverbal IQ
scores. American Educational Research Journal, 12, 467-477.
Jensen, A. R. (1980). Bias in mental testing. London: Methuen.
Kamin, L. J. (1974). The science and politics of IQ. Aylesbury: Penguin.
Moffitt, T. E., Caspi, A., Harkness, A. R., & Silva, P. A. (1993). The natural history of
change in intellectual performance: Who changes? How much? Is it meaningful?
Journal of Child Psychology and Psychiatry, 34, (4), 455-506..
Pinneau, S. R. (1961). Changes in Intelligence Quotient. Boston: Houghton Miffin.
Correlation and consistency in reasoning ….. Page 22
Primrose, A. F. (2000). Verbal reasoning test scores and their stability over time.
Educational Research, 42, (2) 167-174.
Rees, A. H., & Palmer, F. H. (1970). Factors related to change in mental test performance.
Psychological Monographs, 3, 1-57.
Sax, G. (1984). The Lorge-Thorndike Intelligence Tests / Cognitive Abilities Test. In:
Keysner, D. J., & Sweetland, R. C. (Eds) Test Critiques, Volume 1. Kansa: Test
Corporation of America.
Smith, P. & Hague, N. (1993). NFER non verbal reasoning test series. Windsor: nferNelson.
Smith, P., Fernandes, C., & Strand, S. (2001). Cognitive Abilities Test Third Edition:
Technical Manual. Windsor: nferNelson.
Strand, S. (1998). A value added analysis of the 1996 primary school performance tables.
Educational Research, 40, (2), 123-137.
Strand, S. (2000). Cognitive Abilities Test and Key Stage 2 Indicators: Technical Report.
Windsor: nferNelson.
Thorndike, R. L., Hagen, E., & France, N. (1986). Cognitive Abilities Test second edition:
Administration manual. Windsor: nferNelson.
Thorndike, R. L. (1971). Educational Measurement. Second Edition. American Council on
Education: Washington. P379
Vernon, P. E. (1961). The structure of human abilities, second edition. London: Methuen.
Vernon, P. E. (1979). Intelligence: Heredity and Environment. San Francisco: Freeman &
Co.
Whetton, C. (1995). Verbal reasoning tests. In: Husen, T., & Postlethwaite, N. (Eds)
International Encyclopaedia of Education: Oxford: Pergamon Press.
Correlation and consistency in reasoning ….. Page 23
TABLE 1: Pearson correlation coefficients between VR, QR, NVR and mean CAT
score(a)
Year 9 CAT scores (age 13+)
Year 6 CAT scores (age 10+) Verbal Reasoning
QuantitativeReasoning
Nonverbal Reasoning
Mean CAT score
Verbal Reasoning .87 .72 .65 .82
Quantitative Reasoning .69 .79 .70 .79
Nonverbal Reasoning .64 .66 .76 .76
Mean CAT score .81 .79 .75 .89
Notes: (a) Listwise deletion, n=10,644. All correlations significant at p<.0001. The bold figures in the
diagonal show the test-retest correlations. The shaded cells in the upper right quadrant show the correlations
between the batteries at age 10, while the unshaded cells in the lower left quadrant show the correlations
between the batteries at age 13.
TABLE 2: Distribution of change scores for each CAT battery and mean CAT score
Score Difference
Verbal Reasoning
Quantitative Reasoning
Nonverbal Reasoning
Mean CAT score
% % % %
<-23 0 0 0 0
-22 to –18 1 1 1 0
-17 to –13 3 4 4 1
-12 to –8 11 12 10 7
-7 to –3 24 21 19 24
-2 to +2 30 26 24 37
+3 to +7 19 19 20 23
+8 to +12 8 11 13 7
+13 to +17 3 4 6 1
+18 to +22 1 1 3 0
>23 0 0 1 0Note: Figures are rounded to whole numbers and may not sum to exactly 100%.
Correlation and consistency in reasoning ….. Page 24
TABLE 3: 90% confidence bands for CAT batteries taking into account regression to the mean.
90% Confidence Bands (CB) for Obtained Scores
Verbal Reasoning
Quantitative Reasoning
Non-Verbal Reasoning
Obtained Score Lower CB Upper CB Lower CB Upper CB Lower CB Upper CB
69-75 -3 11 -1 13 0 14
76-82 -4 10 -3 11 -2 12
83-89 -5 9 -4 10 -4 10
90-96 -6 8 -6 8 -5 9
97-103 -7 7 -7 7 -7 7
104-110 -8 6 -8 6 -9 5
111-117 -9 5 -10 4 -10 4
118-124 -10 4 -11 3 -12 2
125-131 -11 3 -13 1 -14 0
Notes: These confidence bands assume a SEM of 4.0, use the test-retest correlations presented in Table 1 and
are centred around ‘true’ scores. The formula for estimating true scores was T= + r12*( X - ), where: T=
estimated true score; = population mean, r12=test-retest correlation and X= obtained score (see Thorndike,
1971, P379). This table is a simplification of more detailed data applied automatically to results processed by
the CAT computer scoring service.
TABLE 4: Variance components models for CAT scores at age 13 and GCSE score
Hierarchical
level
GCSE total
points score
VR QR NVR Mean CAT
score
School 6.8% 3.7% 3.8% 2.0% 3.7%
Year 0.3% 0.2% 0.0% 0.2% 0.0%
Pupil 93.0% 96.1% 96.2% 97.8% 97.6%
Notes: The MlWin software was used to model the data. Pupils from Pupil Referral Units and special schools,
and pupils with missing scores on any of the CAT batteries at either age, were excluded from the analysis.
Correlation and consistency in reasoning ….. Page 25
TABLE 5: Summary of value-added analyses
Measure VR QR NVR
Variance at school level after adjusting for initial score(1) 1.1% 3.4% 1.1%
(a) average for the five schools with the highest residuals 1.3 2.0 1.3
(b) average for the five schools with the lowest residuals -1.1 -2.3 -1.6
(a)–(b) (estimate of school effect) 2.4 4.3 2.9
Number of schools with statistically significant effects(1) 5 14 5
Notes
(1) derived from multi-level models with random intercepts and fixed slopes at school, year and pupil levels.
Correlation and consistency in reasoning ….. Page 26
FIGURE CAPTIONS
Figure 1: Predicted retest score from a regression on initial test score for each CAT battery
Correlation and consistency in reasoning ….. Page 27
FOOTNOTES
Correlation and consistency in reasoning ….. Page 28