educational research – short report · web view2. sentence completion – selecting one word...

Correlation and consistency in reasoning test

scores over time

Dr Steve Strand

Paper presented at the Annual Conference of the British Educational Research Association, University of Exeter, England, 12-14 September 2002

Address for Correspondence

Dr Steve Strand, Senior Assessment Consultant, nferNelson

The Chiswick Centre, 414 Chiswick High Road

LONDON W4 5TF

Tel: (020) 8996 8414

e-mail: [email protected]

Keywords: Reasoning ability, correlation, stability, change, consistency, school effects, value added

CORRELATION AND CONSISTENCY IN REASONING

TEST SCORES OVER TIME

ABSTRACT

UK schools have a long history of using reasoning tests, most frequently of Verbal

Reasoning (VR), Non Verbal Reasoning (NVR), and to a lesser extent Quantitative

Reasoning (QR). Results are used to identify students’ learning needs, for grouping students,

for identifying underachievement, and for providing indicators of future academic

performance (Fernandes & Strand, 2000). Despite this widespread use there is little data on

the long term consistency of VR, QR and NVR as discrete abilities. This study compares the

performance of over 10,000 pupils who completed the Cognitive Abilities Test Second

Edition (CAT2E) in year 6 (age 10+) and year 9 (age 13+) and GCSE public examinations

in year 11 (age 15+). The results reveal very high correlations in scores over time, ranging

from .87 for VR to .76 for NVR, but also show around one-quarter of pupils on the VR test

and one-third of pupils on the QR and NVR tests changed their scores by eight or more

standard score points. Schools accounted for only a small part of the total variation in

reasoning scores, although they accounted for a much greater proportion of the variation in

measures of attainment such GCSE. School effects on pupils’ progress in the reasoning tests

between age 10 and age 13 were relatively modest. Some practical and policy implications

for schools are discussed.

Correlation and consistency in reasoning …..

CORRELATION AND CONSISTENCY IN REASONING TEST

SCORES OVER TIME

INTRODUCTION

UK schools have a long history of using tests to assess students’ reasoning ability. Reasoning

tests attempt to assess the perceiving of relationships among abstract elements and symbols.

They differ from attainment tests in that the test material is not intended to be a sampling of

what is taught to any particular age group in school, rather it attempts to minimise the effect

of specific curricular experience. The tests use general, not specialised, knowledge that

individuals in a particular age group could have acquired from a broad variety of experiences

in or out of school. The basic test elements are kept relatively simple, clear and familiar with

the intended emphasis on discovery of relationships and flexibility of thinking. The most

frequently used tests are of Verbal Reasoning (VR), Non Verbal Reasoning (NVR), and to a

lesser extent Quantitative Reasoning (QR). Verbal tests use symbols representing words, QR

tests use symbols representing numbers or quantities, while NVR tests use symbols

representing spatial, geometric or figural patterns.

Prior to the mid-1970’s reasoning tests were a formal part of the process of selection to

secondary school at age 11 in most areas of the UK, although this practice has largely

disappeared with comprehensive education. However reasoning tests remain in widespread

use for the diagnosis of learning needs, for grouping students, for identifying

underachievement, and for providing indicators of future academic performance (Strand,

2000; Fernandes & Strand, 2000).


Despite their widespread use, there is relatively little empirical data on the consistency over

time of group tests of VR, QR and NVR as discrete reasoning abilities. Much has been

written about the stability or otherwise of full scale IQ scores. Early studies are well

reviewed by Pinneau (1961), Anastasi (1976) and Vernon (1979). As would be expected,

test-retest correlations are higher the shorter the interval between tests, and tend to be higher

for older children. Vernon concludes that typical correlations from 6 to 10 years, or from 10

to 17 years, are approximately .70 (Vernon, 1979, p75). More recent studies (e.g., Hindley &

Owen, 1978; Moffitt, Caspi, Harkness & Silva, 1993) have produced even higher

correlations. For example Moffitt et. al. tested nearly 1,000 pupils on the Weschler

Intelligence Scale for Children - Revised (WISC-R) at age 7, 9, 11 and 13, and report

correlation coefficients ranging from .74 between age 7 and 13 up to .85 between age 9 and

11 and between age 11 and 13.

Full scale IQ tests, such as the Weschler and Stanford-Binet, cover a wide range of item

types, sometimes require oral responses from the testee, necessitate the manipulation of

material and are administered on a one-to-one rather than a group basis. Similar levels of

consistency are not assured for group administered reasoning tests. Given the widespread use

of group tests within educational settings, data on their long term consistency is needed.

Additionally, factorial studies in which a variety of ability factors are distinguished (Vernon,

1961), and the distinction between “fluid” and “crystallised” intelligence proposed by Cattell

(1963), suggest it is particularly desirable to have information separately on VR, QR and

NVR reasoning tests. Some studies (Hopkins & Bracht, 1975) have indicated higher

consistency for VR rather than NVR test scores. This may have important practical as well as

theoretical implications.


The current study looks in particular at the consistency of scores on the Cognitive Abilities

Test Second Edition (CAT2E) between the ages of 10 and 13 to shed light on the following

questions:

What is the correlation between pupils’ reasoning test scores at age 10 and age 13? Are

there differences between verbal, quantitative and non-verbal reasoning tests in their

consistency over time?

What is the extent of change in individual pupils’ scores over time? How should this

variation be interpreted?

What is the influence of the school on pupils’ performance in reasoning tests? Can the

results be used to assess the ‘value-added’ by the school?

What are the implications for the interpretation and use of reasoning test scores?

What are the implications for educational practice and policy?

METHOD

The dataset

Data was collected over a three year period from a large Local Education Authority (LEA) in

the South East of England. The CAT2E is used with all pupils in the LEA on two occasions:

Level C is completed in October of year 6 at primary school (average pupil age 10:6) and

Level F in October of year 9 at secondary school (average pupil age 13:6). Both sets of

CAT2E scores, along with subsequent year 11 GCSE examination results, were collected on

three separate cohorts of pupils who completed their GCSE examinations in summer 1998,

1999 and 2000 respectively. A total of 10,644 pupils were included in the study. For some

analyses pupils attending Pupil Referral Units or special schools were excluded, as were

pupils with a missing score on any of the three CAT batteries at each age, leaving a core


sample of 10,621 pupils attending 25 mainstream secondary schools. The average year group

within schools comprised 144 pupils, with two-thirds of year groups in the range 101 to 187

pupils.

The Cognitive Abilities Test Second Edition (CAT2E)

The Cognitive Abilities Test Second Edition (CAT2E) (Thorndike, Hagen & France, 1986)

provides an assessment of reasoning skills with separate Verbal Reasoning (VR),

Quantitative Reasoning (QR) and Non Verbal Reasoning (NVR) batteries. The pupil’s mean

score over the three batteries (mean CAT score) is also calculated. The test is divided into six

levels A-F, and is UK standardised across the age range 7:6 to 15:9. The CAT is the most

widely used test of reasoning ability in the UK, with 800,000 students assessed during the

academic year 1999/2000.

The Verbal Reasoning (VR) battery consists of four tests:

1. Vocabulary – choosing a simile from a list of five possibilities (e.g., wish : agree,

bone, over, want, waste)

2. Sentence Completion – selecting one word from a list of five (e.g., John likes to ____

a football match: eat, help, watch, read, talk)

3. Verbal Classification – given three or four words belonging to one class, select which

further word from a list of five belongs to the same class (e.g. eye, ear, mouth : nose,

smell, head, boy, speak)

4. Verbal Analogies – given one pair of words, complete a second pair from five

possibilities (e.g., big->large; little->? : boy, small, late, lively, more))

The Quantitative Reasoning (QR) battery consists of three tests:


1. Quantitative Relations: Given two quantities decide whether one is greater, equal to

or less than the other (e.g., ¼ vs. ½ );

2. Number Series: select one from five possible choices to complete the series (e.g.,

2,4,6,8,?)

3. Equation Building: utilise the given elements to create a true equation (e.g., 2 2 3 +

x : 6, 8, 9, 10, 11)

The Non Verbal Reasoning (NVR) battery consists of three tests:

1. Figure Classification: given three or four shapes belonging to one class, select which

further shape from a list of five belongs to the same class;

2. Figure Analogies: given one pair of shapes, complete a second pair from five

possibilities

3. Figure Synthesis : Given different shaped ‘pieces’, decide whether target figures can

or cannot be constructed from them.

The reliability of the CAT2E

Any interpretation of long term consistency in CAT2E scores has to be considered in the

context of the reliability of the test. Variation in test scores over time will be due to a variety

of factors. At least some of these factors will be transient, including variation in pupils’

motivation or affective state at the time of taking the test/s, possible variations in test

administration, errors in marking etc1. For this reason, published tests should provide clear

measures of their reliability. Such reliability data for the CAT2E comes from a number of

sources. Internal consistency estimates for the CAT2E are high, averaging 0.94, 0.91 and

0.93 for VR, QR and NVR batteries respectively (Thorndike et. al., 1986). Six month test-

retest correlations are also available from the US version (CogAT, Form 3) for each battery

at each test level (reported by Sax, 1984). Correlations for VR range across levels from .85 to


.93, for QR from .78 to .88, and for NVR from .81 to .89. In all instances the lower values

are for the youngest (age 8) pupils. These figures indicate that the CAT2E is highly reliable,

and effectively set a ceiling on the longer term correlations we might expect from the three

year test-retest data reported in this study.

RESULTS

Consistency over time

Table 1 presents the correlations between the different batteries at each age and the

correlations between batteries over time. The diagonal figures in bold show the three year

test-retest correlations. These reveal correlations of .87 for verbal reasoning, .79 for

quantitative reasoning, .76 for non-verbal reasoning and .89 for mean CAT score. Given the

three year time period between the tests, which includes the period of transfer between

primary and secondary school, these correlations are remarkably high. It is notable that the

correlation for VR is substantially higher than for either QR or NVR. These correlations

indicate substantial consistency in pupils’ reasoning scores over the three year period.

<------------------------------------------>

INSERT TABLE 1 ABOUT HERE

<------------------------------------------>

The intercorrelations between the three CAT batteries are shown off the diagonal in Table 1,

with the upper right cells showing the intercorrelations at age 13 and lower left cells showing

the intercorrelations at age 10. These range from .64 between VR and NVR at age 13, to .72

between VR and QR at age 10. While these correlations are highly significant, there are

sufficiently low to suggest real differences in pupils performance in the verbal, quantitative

and non-verbal tests.


There was no significant difference in the test-retest correlations of boys and girls, which

were identical for VR and differed only slightly for QR (.80 vs. .78) and NVR (.75 vs. .77)

for boys and girls respectively.

Individual variability

A high correlation does not mean that pupils’ scores are constant and unchanging, indeed

high correlation coefficients can conceal important change within individuals over time.

Change scores, the difference between a pupil’s standardised score on the retest and their

initial standardised score, were calculated for each battery. Table 2 indicates the proportion

of pupils with a change score in specified ranges.

<------------------------------------------>


<------------------------------------------>

The extent of these changes need to be considered in relation to the standard error of

measurement (SEM) of the test. The SEM for each CAT battery is estimated to be four

standard score points (Thorndike et. al., 1986, p49). Given this figure, the 90% confidence

band for any pupil’s test score is within the range 1.645*SEM, or plus or minus seven

standard score points. Change scores within this range represent fluctuations to be expected

from measurement error. The percentage of pupils with a retest score within 7 points of their

initial score is 72% for VR, 66% for QR and 63% for NVR2. While this indicates

substantially consistency, it also shows a significant change in score for approximately one-

quarter of pupils on the VR test and one-third of pupils on the QR and NVR tests.


The change scores for mean CAT score are also given in Table 2. These are very low with

84% of retest scores within plus or minus 7 points of initial score, and 95% within plus or

minus 10 points. However to some extent the change scores are reduced because the mean

score across batteries necessarily has a smaller standard deviation (SD) than the individual

batteries. Thus at age 13 the SD of mean CAT score was only 11.5, compared to 13.6, 12.8

and 12.5 for the VR, QR and NVR tests respectively.

Regression to the mean

When interpreting simple change scores between tests taken on two occasions, it is important

to consider the influence of regression to the mean. Regression to the mean is a statistical

artefact reflecting the fact that, since the correlation between scores at any two ages is always

less than perfect, the second test score will tend to be closer to the population mean than the

first test score. This is illustrated in Figure 1 which shows the effect directly by regressing

second test score on initial test score for VR, QR and NVR respectively. We can see that

regression to the mean is: (i) greatest where initial scores are towards the extremes, and; (ii)

greater for NVR, QR and VR tests in that order, reflecting their lower test-retest correlations

respectively. As a result of regression to the mean, change scores tend to be positive for low

initial scores and negative for high initial scores.

<------------------------------------------>

INSERT FIGURE 1 ABOUT HERE

<------------------------------------------>

What are the implications of regression to the mean? First, the confidence bands for a pupil’s

score will not be symmetrical over the whole score range but will tend to be skewed at the

extremes. Table 3 presents confidence bands calculated across a range of scores for each

CAT2E battery. It can be seen that only around the mean score of 100 are the confidence


bands symmetrical at +7 and – 7 standard score points respectively. As scores move away

from the mean the bands become asymmetrical so that a greater part of the confidence band

lies in the direction of the average score for the test. It can also be seen that this effect is most

pronounced for the NVR test and least pronounced for the VR test. Practitioners wishing to

evaluate whether an individual pupil’s change score is significant, should use the specific

values in this table rather than the applying the overall confidence band.

<------------------------------------------>


<------------------------------------------>

Second, and more importantly, a consequence of regression to the mean is that any group of

low scorers can be expected to show a gain in score on a second occasion without any

intervention. Equally a group of high scorers can be expected to show a drop in score on a

second occasion. Practitioners must therefore be cautious in their interpretation of simple

change scores for groups of pupils. To attribute any change to an educational intervention,

such as a thinking skills or literacy training programme, will require sophisticated statistics

such as analysis of covariance with initial test score as a covariate, or comparisons with a

matched ‘control’ group who do not receive the intervention.

School effects

Multi-level regression models were computed to examine the degree of variance at each level

within the three hierarchical levels that describe the data, i.e., pupils (level 1) grouped within

year cohorts (level 2) grouped within schools (level 3). Separate models were run for each of

the age 13 CAT2E scores and for the same pupil’s GCSE public examination results. The

first set of models include only the intercept terms at each level to allow a variance

components analysis, as shown in Table 4. There are two key findings.


<------------------------------------------>


<------------------------------------------>

First, the variance attributable to the school level varies significantly across the three CAT

batteries. The school level variance for VR and QR (around 3.7%) is almost twice the school

level variance for NVR (around 2%). This might be expected since, in contrast to the NVR

test, both the verbal and quantitative reasoning tests require some familiarisation with basic

literacy and numeracy concepts and may therefore be influenced to a greater degree by the

school and curriculum exposure.

Second, the school level variance in reasoning test scores, even for VR and QR, is small

compared to the school level variance for measures of pupil attainment. For GCSE total

points score, the proportion of variance accounted for at the school level was almost 7%,

nearly twice the amount for VR and QR, and over three times the amount for NVR. For

individual GCSE subjects, the following proportions of variance at the school level were

found: English (6%), Science (6%), Design & Technology (8%), History (8%), Geography

(9%), English Literature (10%), Mathematics (11%), French (11%) and Art (16%). Clearly

schools account for a much smaller proportion of the variance in reasoning test score than

they do for attainment tests such as public examinations.

A further set of ‘value added’ analyses were completed by including initial test score in the

above models. We can then determine whether there was any significant variation between

secondary schools in pupils’ progress in the reasoning tests between age 10 and 13. There

were statistically significant associations between secondary school and pupil progress for all

three batteries, although the magnitude of the effect varied across batteries. A conservative


estimate of the school effect can be generated by comparing the average residual for the five

schools with the highest residuals3 against the average for the five schools with the lowest

residuals (see Table 5). The difference in pupil progress between the most and least

‘effective’ schools in these terms was around 2.4 and 2.9 standard score points for VR and

NVR respectively. However for both tests only five of the 25 schools could be reliably

distinguished from the others in terms of pupil progress (three schools with significantly

above average and two schools with significantly below average progress). The picture was

markedly different for QR. Here the ‘school effect’ was much greater, with a difference of

4.3 standard score points between the five most and five least ‘effective’ schools. A total of

14 schools had significant residuals, seven schools recording significantly above average and

seven significantly below average progress.

----------------------------------------


----------------------------------------

Correlations between the school level residuals for each test were calculated to determine

how consistent schools were in their effects across the different CAT batteries. While there

was a significant correlation between the school residuals for VR and QR (r=0.43, p<.03)

there was no significant correlation between the residuals for VR and NVR (r=0.35) or QR

and NVR (r=.29). Some schools were more effective than others in promoting academic

reasoning as indexed by the verbal and quantitative tests, but these were not necessarily the

schools with significant changes in NVR score.


DISCUSSION

The main findings of the study can be summarised as follows:

The correlation between mean CAT score at 10:6 and 13:6 was .89, indicating a high

degree of consistency in pupil’s reasoning test scores over the three year period. The

correlation was particularly strong for VR at .87, but was also high for QR and NVR

at .79 and .76 respectively.

High levels of correlation do not mean there is no change in pupils’ scores over time.

Even for the VR test around one-quarter of pupils recorded a change of eight or more

standard score points. This rose to around one-third of pupils for the QR and NVR tests.

Because of regression to the mean, simple change scores should not be used to evaluate

the outcome of any educational intervention. Sophisticated statistics or the use of control

groups are required for such purposes.

The intercorrelations between the three CAT batteries ranged from .64 between VR and

NVR to .72 between VR and QR. While these correlations are highly significant, they are

sufficiently low to suggest real differences in pupils’ reasoning skills in the verbal,

quantitative and non-verbal domains.

There were significant differences between schools in pupils’ progress on the reasoning

tests between age 10 and 13. The variation between the most and least ‘effective’ schools

was around two standard score points for VR, three points for NVR and four points for

QR. Some schools were more effective than others in promoting progress in academic

reasoning as indexed by increases in verbal and quantitative tests, but these were not

necessarily the schools with significant changes in NVR score.


Despite the above, school effects on reasoning score performance are small in

comparison to their effects on pupils’ attainment, such as performance in GCSE public

examinations.

There are three areas that deserve further detailed discussion.

Consistency in reasoning scores over time

The results indicate high consistency over the three years age 10 to 13. The correlation for

mean CAT score (.89) is at the top of the range that is observed for correlations between the

results of individually administered IQ tests (Vernon, 1979). For example, Rees & Palmer

(1970) report a meta-analysis of five longitudinal studies using scores on the Stanford-Binet

test. They report correlations of .81 between scores at age 6 and 12, and .81 between scores

at age 12 and 17. Hindley & Owen (1978) report a correlation of .69 between Stanford-Binet

score at age 11 and an individually administered AH4 test at age 14, although they consider

this an underestimate of consistency because of the change of test between the two occasions.

When the same pupils had been tested at age 8 and age 11 on the Stanford-Binet the

correlation was .89, and when tested between age 14 and age 17 on the AH4 the correlation

was .87. Moffitt et. al., (1993) report a correlation of .84 between WISC-R full-scales IQ at

age 11 and age 13. Given that group administered tests reportedly produce much lower

correlations over time than individually administered tests (Vernon, 1979) this is a very

substantial finding for the CAT.

The test-retest correlation for VR (.87) is notably higher than for either QR (.79) or NVR

(.76). Hopkins & Bracht (1975) report test-retest correlations between the group administered

Lorge-Thorndike VR and NVR tests at age 10 and age 13 of .79 for VR and .61 for NVR.


Thorndike et. al. (1986) report that analysis from an LEA wide study suggested “two to three

year correlations of the order of .80, .75 and .65 for VR, QR and NVR respectively” (p94).

The current author has also undertaken an analysis of the technical data reported in the

teacher manuals for the National Foundation for Educational Research (NFER) VR and NVR

test series (Hagues, Smith & Courtneay, 1993; Smith & Hagues, 1993). The two series

consist of separate tests for the 8-9, 10-11 and 12-14 age groups. The average 12 month test-

retest correlation for the three VR tests is .85 against .76 for the three NVR tests.

It is possible that the lower correlation for NVR reflects the comparative novelty of the NVR

material which is not part of the formal curriculum in the UK. This relative novelty can lead

to greater practice effects. For example, the analysis of the 12 month test-retest data for the

NFER VR and NVR series revealed an average increase of around one standard score point

for the VR tests compared with an average increase of over four standard score points for the

NVR tests. In the current study the average changes were smaller but showed the same

pattern of greatest increase for NVR, with mean change scores of –0.6, 0.3 and +1.2 for VR,

QR and NVR respectively. Practice effects will not necessarily lower test-retest correlations.

However if pupils with little experience of NVR item types make particular gains in the

second test, then this will lower the overall correlation, and their does appear to be evidence

of this4. Greater practice effects may be one reason underlying the lower correlations for

NVR. However, we may also hypothesise that VR scores are particularly stable because

social and educational pressures emphasise verbal skills in particular, and such skills are

prerequisites for subsequent learning in a cumulative manner. As Anastasi (1976) states:

“Not only does the individual retain prior learning, but much of his prior learning provides

tools for subsequent learning. Hence the more progress he has made in the acquisition of

intellectual skills and knowledge at any one point in time, the better able he is to profit from


subsequent learning experiences.” (p328). In short, nothing succeeds like success. While this

is true of a range of skills, the pre-eminence of verbal skills for academic work may make

verbal reasoning particularly stable.

The correlations reported here are of course specific to the age range tested and it is not

assured that similar correlations would be found at other ages. Available data tends to suggest

that scores would be less stable at earlier ages, and this may be particularly pronounced for

NVR (Jensen, 1980; Vernon, 1979; Hopkins & Bracht, 1975; Sax, 1984). Care needs to be

exercised therefore in generalising from the current results, particularly to younger age

groups.

School effects

Some authors (e.g., Primrose, 2000) have suggested that changes in reasoning test scores

may be used as a direct means of assessing the ‘value-added’ by a school, given that the

provision of quality teaching along with a variety of learning opportunities should enhance

the cognitive skills of pupils. The data presented here suggests there were significant

differences between schools in pupils’ progress in the reasoning tests between age 10 and age

13. These school level differences may be statistically significant, but what is their

educational significance? What does an average difference of 3 standard score points on the

CAT represent in terms of some more widely known educational standard? There is a large

body of work relating CAT scores and GCSE examination results (e.g., Fernandes & Strand,

2000). From this we know, for example, that on average 55% of pupils with a mean CAT

score of 98 achieve 5 or more GCSE A*-C grades, while an average of 67% of pupils with a

mean CAT score of 101 achieve this level. These differences are not large, but neither are

they trivial. A school that can on average add three standard score points to CAT scores


between year 6 and year 9 may also be likely to increase pupils’ GCSE results by a

significant margin.

However, it is clear that school effects on pupils’ reasoning scores are small in relation to

their effects on pupils’ attainment. 3.7% of the variance in mean CAT score was at the school

level compared to around 7% of the variance in GCSE points score, and an even higher

proportion in individual GCSE subjects such as design & technology, history, geography,

English literature, mathematics, French and art. We should not be surprised that the school

exerts a particularly strong effect on outcomes such as public examination results, since these

are tests of what is directly taught in schools. For the same reason it is not surprising that the

secondary school exerts a particularly strong effect on pupils’ progress in those reasoning

skills most closely allied to the mathematics curriculum, with the school level accounting for

three times the proportion of variance in progress in QR compared to VR or NVR. This is

consistent with the general finding that mathematics is a subject most uniquely learned at

school (Brandsma & Knuver, 1989; Strand, 1998) while language skills are more a joint

operation of the school and the home/wider culture, and non-verbal skills seem minimally

influenced by school. These findings emphasise that reasoning tests are particularly well

placed to act as baseline assessments for secondary schools, since they take account of

important variation in school intakes which is largely, although not exclusively, outside of

the schools control.

Finally, it is notable that school accounted for only 2% of the variance in NVR score

compared to 3.7% of the variance in VR and QR scores. This finding supports Cattell’s

(1963) theory that NVR tasks are excellent measures of fluid-analytic abilities (gf) and are

less affected by acculturation, including formal schooling, than verbal and quantitative tasks.


VR and QR assess inductive and deductive reasoning, which Cattell would classify as fluid-

analytic abilities, but using acquired verbal and numerical concepts, and may be more

broadly defined as crystallised abilities (gc).

Practical and policy implications

1

?. Pupil’s CAT2E answer sheets can be submitted for computer scoring or can be hand

marked. The results for all pupils in this study were computer scored. Compared to

human marking this minimises potential errors in the consistent and accurate application

of the marking key for each question, the totalling of raw scores for each test and

battery, the calculation of students’ age on the day of the test and the conversion from

raw scores to standard age scores.

2 . Several previous studies have reported the percentage of pupils with change scores

within the range plus or minus 10 standard score points. For comparative purposes, the

relevant figures here are 86%, 81% and 79% for VR, QR and NVR respectively.

3 The residual is the difference between the predicted age 13 score based on age 10 score,

and the actual age 13 score. Residuals will be positive where the actual score is higher

than predicted and negative otherwise.

4 4. This interpretation is supported by a significant negative correlation between

‘volatility’, as indicated by absolute change in NVR score irrespective of sign, and initial

test score (r=-.11, p<.0001). Absolute amount of change tended to be greatest for those

pupils with low initial NVR scores. This was not seen for QR, and for VR there was


Reasoning tests provide an excellent baseline assessment on entry to secondary school

because they are strongly related to pupils’ subsequent performance in national tests and

examinations, but relatively uninfluenced by school effects and prior educational

experiences.

The high correlation of reasoning scores with subsequent national tests and examinations

allows the calculation of robust indicators which can support schools in the process of

individual pupil target setting (Smith, Fernandes & Strand, 2001; Fernandes & Strand,

2000). Such indicators need to be interpreted carefully since reasoning ability is only one

of a large number of factors that influence exam performance, including motivation and

effort, opportunity, teaching quality, parental support and many others.

The high test-retest correlations suggest that a programme of annual testing is

unnecessary. However the extent of individual change suggests that it is important to

retest at times when key educational decisions are required. If one of the purposes of the

reasoning tests is to generate indicators to inform target setting, then schools should retest

pupils during year 9 (age 14) to ensure their indicators are up-to-date and relevant.

The relatively low correlation between VR and NVR (.65) indicates the importance of

using both a VR and a NVR test in the assessment of students’ reasoning ability. Verbal

tests, because of the emphasis which is placed on reading ability and familiarity with

language, can unduly influence the performance of pupils with English as an additional

language, pupils with poor primary school experience or those thought to have specific

difficulties with language based work (Elliott, 1990).

The relative stability of reasoning test scores has no direct bearing on the long running

debate over the relative influence of hereditary and environment on intelligence. The fact

that reasoning test scores are highly correlated over time may simply reflect systematic

and stable environmental influences, such as the socio-economic status of home.

actually more volatility among those with high initial scores (r=.12, p<.0001).


Conclusion

It is unfortunate that reasoning tests became so strongly associated with the period of

selective secondary education in the UK as this has left a negative perception of the tests with

many educators and others. However antipathy to the tests themselves is essentially a bad

case of shooting the messenger. We should remember that the tests were serving a political

purpose, and to support their use for selection claims were made for the tests that were

heavily contested (e.g., Kamin, 1974). For example, that reasoning test measured ‘innate’

ability, that pupil’s performance was not amenable to coaching or instruction, and that the

scores were unchanging indicating an invariant capacity of the pupil. Modern interpretations

view reasoning tests as reflecting the pupils’ experiences up to the time of testing rather than

providing an indication of fixed potential (Whetton, 1995). The current study serves to

emphasis the degree of individual pupil change that can underlie even very high correlations.

When used appropriately, reasoning tests have a positive and valuable role to play in

education.


REFERENCES

Anastasi, A. (1976). Psychological testing (fourth edition). New York: MacMillan.

Brandsma, H. P., & Knuver, J. (1989). Effects of school classroom characteristics on pupil

progress in language and arithmetic. International Journal of Educational Research,

13, (7), 777-788.

Cattell, R. B. (1963). Theory of fluid and crystallised intelligence: a critical experiment.

Journal of Educational Psychology, 54, 1-22.

Elliott, C. D. (1990). Nonverbal tests. In: Walberg, H. J. & Haertel, G. D. (Eds).

International Encyclopaedia of Educational Evaluation. Oxford: Pergamon Press.

Fernandes, C. & Strand, S. (2000.). Cognitive abilities Test and KS3/GCSE indicators:

Technical Report. Windsor: nferNelson.

Hague, N., Smith, P. & Courtney (1993). NFER verbal reasoning test series. Windsor:

nferNelson.

Hindley, C. B., & Owen, C. F. (1978). The extent of individual changes in IQ for ages

between 6 months and 17 years, in a British longitudinal sample. Journal of Child

Psychology and Psychiatry, 19, 329-350.

Hopkins, K. D., & Bracht, G. H. (1975). Ten year stability of verbal and nonverbal IQ

scores. American Educational Research Journal, 12, 467-477.

Jensen, A. R. (1980). Bias in mental testing. London: Methuen.

Kamin, L. J. (1974). The science and politics of IQ. Aylesbury: Penguin.

Moffitt, T. E., Caspi, A., Harkness, A. R., & Silva, P. A. (1993). The natural history of

change in intellectual performance: Who changes? How much? Is it meaningful?

Journal of Child Psychology and Psychiatry, 34, (4), 455-506..

Pinneau, S. R. (1961). Changes in Intelligence Quotient. Boston: Houghton Miffin.


Primrose, A. F. (2000). Verbal reasoning test scores and their stability over time.

Educational Research, 42, (2) 167-174.

Rees, A. H., & Palmer, F. H. (1970). Factors related to change in mental test performance.

Psychological Monographs, 3, 1-57.

Sax, G. (1984). The Lorge-Thorndike Intelligence Tests / Cognitive Abilities Test. In:

Keysner, D. J., & Sweetland, R. C. (Eds) Test Critiques, Volume 1. Kansa: Test

Corporation of America.

Smith, P. & Hague, N. (1993). NFER non verbal reasoning test series. Windsor: nferNelson.

Smith, P., Fernandes, C., & Strand, S. (2001). Cognitive Abilities Test Third Edition:

Technical Manual. Windsor: nferNelson.

Strand, S. (1998). A value added analysis of the 1996 primary school performance tables.

Educational Research, 40, (2), 123-137.

Strand, S. (2000). Cognitive Abilities Test and Key Stage 2 Indicators: Technical Report.

Windsor: nferNelson.

Thorndike, R. L., Hagen, E., & France, N. (1986). Cognitive Abilities Test second edition:

Administration manual. Windsor: nferNelson.

Thorndike, R. L. (1971). Educational Measurement. Second Edition. American Council on

Education: Washington. P379

Vernon, P. E. (1961). The structure of human abilities, second edition. London: Methuen.

Vernon, P. E. (1979). Intelligence: Heredity and Environment. San Francisco: Freeman &

Co.

Whetton, C. (1995). Verbal reasoning tests. In: Husen, T., & Postlethwaite, N. (Eds)

International Encyclopaedia of Education: Oxford: Pergamon Press.


TABLE 1: Pearson correlation coefficients between VR, QR, NVR and mean CAT

score(a)

Year 9 CAT scores (age 13+)

Year 6 CAT scores (age 10+) Verbal Reasoning

QuantitativeReasoning

Nonverbal Reasoning

Mean CAT score

Verbal Reasoning .87 .72 .65 .82

Quantitative Reasoning .69 .79 .70 .79

Nonverbal Reasoning .64 .66 .76 .76

Mean CAT score .81 .79 .75 .89

Notes: (a) Listwise deletion, n=10,644. All correlations significant at p<.0001. The bold figures in the

diagonal show the test-retest correlations. The shaded cells in the upper right quadrant show the correlations

between the batteries at age 10, while the unshaded cells in the lower left quadrant show the correlations

between the batteries at age 13.

TABLE 2: Distribution of change scores for each CAT battery and mean CAT score

Score Difference

Verbal Reasoning

Quantitative Reasoning

Nonverbal Reasoning

Mean CAT score

% % % %

<-23 0 0 0 0

-22 to –18 1 1 1 0

-17 to –13 3 4 4 1

-12 to –8 11 12 10 7

-7 to –3 24 21 19 24

-2 to +2 30 26 24 37

+3 to +7 19 19 20 23

+8 to +12 8 11 13 7

+13 to +17 3 4 6 1

+18 to +22 1 1 3 0

>23 0 0 1 0Note: Figures are rounded to whole numbers and may not sum to exactly 100%.


TABLE 3: 90% confidence bands for CAT batteries taking into account regression to the mean.

90% Confidence Bands (CB) for Obtained Scores

Verbal Reasoning

Quantitative Reasoning

Non-Verbal Reasoning

Obtained Score Lower CB Upper CB Lower CB Upper CB Lower CB Upper CB

69-75 -3 11 -1 13 0 14

76-82 -4 10 -3 11 -2 12

83-89 -5 9 -4 10 -4 10

90-96 -6 8 -6 8 -5 9

97-103 -7 7 -7 7 -7 7

104-110 -8 6 -8 6 -9 5

111-117 -9 5 -10 4 -10 4

118-124 -10 4 -11 3 -12 2

125-131 -11 3 -13 1 -14 0

Notes: These confidence bands assume a SEM of 4.0, use the test-retest correlations presented in Table 1 and

are centred around ‘true’ scores. The formula for estimating true scores was T= + r12*( X - ), where: T=

estimated true score; = population mean, r12=test-retest correlation and X= obtained score (see Thorndike,

1971, P379). This table is a simplification of more detailed data applied automatically to results processed by

the CAT computer scoring service.

TABLE 4: Variance components models for CAT scores at age 13 and GCSE score

Hierarchical

level

GCSE total

points score

VR QR NVR Mean CAT

score

School 6.8% 3.7% 3.8% 2.0% 3.7%

Year 0.3% 0.2% 0.0% 0.2% 0.0%

Pupil 93.0% 96.1% 96.2% 97.8% 97.6%

Notes: The MlWin software was used to model the data. Pupils from Pupil Referral Units and special schools,

and pupils with missing scores on any of the CAT batteries at either age, were excluded from the analysis.


TABLE 5: Summary of value-added analyses

Measure VR QR NVR

Variance at school level after adjusting for initial score(1) 1.1% 3.4% 1.1%

(a) average for the five schools with the highest residuals 1.3 2.0 1.3

(b) average for the five schools with the lowest residuals -1.1 -2.3 -1.6

(a)–(b) (estimate of school effect) 2.4 4.3 2.9

Number of schools with statistically significant effects(1) 5 14 5

Notes

(1) derived from multi-level models with random intercepts and fixed slopes at school, year and pupil levels.


FIGURE CAPTIONS

Figure 1: Predicted retest score from a regression on initial test score for each CAT battery


FOOTNOTES


educational research – short report · web view2. sentence completion – selecting one word...

Documents