iq score interpretation in atkins mr/id death penalty cases: the good, bad and the ugly

IQ Score Interpretations in Atkins Cases

Kevin S. McGrew, PhD

DirectorInstitute for Applied Psychometrics (IAP)

Additional info re:Kevin McGrew and IAP

can be found at the MindHub™ web portal

www.themindhub.com

For additional information and to stay current (ICDP blog)

www.atkinsmrdeathpenaltly.com

….

….

….

ICDP


A recently successful Atkins case (state agreed to LWOP a few weeks prior to evidentiary hearing) is bases of presentation but will be

augmented with information from other cases

Case involved the Flynn Effect: But we will not be covering

today

Recommended article (more at ICDP blog)

ICDP

“Outliers” – why?

State expert built argument around the WAIS-R scores being the best estimates of defendants true intelligence (underlying “You can’t fake bad” strategy ) and dismissed other scores as most likely due to malingering—arguments not based on sound and reliable methods of science

State expert failed in professional due diligence to consider scientific based explanations of the consistencies and inconsistencies in the complete collection of scores

Median of all = 68

It is statistically or mathematically inappropriate to compute the arithmetic average (mean) of IQ scores. The median is acceptable, under certain circumstances

The only way to compute an average (mean) IQ score is to use a complex equation that incorporates the reliabilities of all scores and the intercorrelations among all scores

Median is acceptable metric under certain conditions

Strong convergence of indicators

Fundamental Issue: Comparability (Exchangeability) of IQ Scores

Intellectual Functioning: Conceptual Issues

Kevin S. McGrew and Keith F. Widaman

AAIDD Death Penalty Manual Chapter (in preparation)

Fundamental Issue: Comparability of IQ Scores

“Not all scores obtained on intelligence tests given to the same person will be identical” (AAIDD, 2010, p. 38)

The global (full scale) IQ from different tests are frequently similar…Other times the IQ scores will be markedly different…a finding that often produces consternation for examiners and recipients of psychological reports


Floyd et al. (2008) used generalizability theory methods to evaluate IQ-IQ exchangeability across ten different IQ battery global composite g-score composites (comprised of 6 to 14 individual tests) across approximately 1,000 subjects


Average (mdn) r = .76 – lets round to .80

Coefficient of determination r2 x 100 = 64 % shared variance

Test A

Test B

.r = .80

Shared common abilities


Test A

Test B

.r = .80

Shared common abilities

“psychologists can anticipate that 1 in 4 individuals taking an intelligence test battery will receive an IQ more than 10 points higher or lower when taking another battery”

Floyd et al. (2008)

The standard error of the difference (SEdiff) must be used to ascertain if the scores in

question are reliably different

SEdiff = 15 x SQRT[2 - r11 - r22]

Test A reliability = .95Test B reliability = .93

1 SEdiff (68 % confidence) = 5.2 points2 SEdiff (95 % confidence) = 10.4 points

Before interpreting the scores from these two IQ tests as being significantly difference, an IQ-IQ difference of at least

10+ points would be required

Easier way via use of confidence band rule-of-thumb

e.g., None of these 6 tests

are sign. different from one another

e.g., Not sign.

different from each

other

If 95 % SEM confidence bands for compared scores do not touch, the difference is likely a reliable difference and hypotheses about the difference should be enteratined

If 95 % SEM confidence bands for compared scores overlap, then the difference is likely not a reliable difference and should not generate significant hypotheses about score differences.

The standard error of the difference (SEdiff)

confidence band rule-of-thumb

e.g., WAIS-R score differences represent

reliable differences with all other obtained IQ

scores

The higher WAIS-R

scores is a scientifically based fact

in this case. One needs to accept

and to explain why.

IQ-IQ score differences: Scientific hypotheses that warrant exploration

• Test administration or scoring errors• Practice effects• Malingering / effort• Norm obsolescence (Flynn effect)• Content differences between different tests or different revisions of the same test• Little known psychometric problems with some of the “gold standards”• Individual/situational factors for person or specific test session

Today will focus only on select topics –

only those relevant to this example case

and some of the more unknown or

misunderstood issues

Unscientific IQ-IQ score difference hypthoses I have seen or read

Voodoo psychometrics

Will focus only on select topics – esp.

those relevant to the example case and some of the more

unknown or misunderstood issues

Outliers – why?

Most likely scientific explanations in this case Ability content differences between different tests or different revisions of the same test

• “Drilling down” further – changes in g-loadings/saturation of subtests included on WAIS-R and WAIS-III/IV

T3

T4

T5

T6

T7

T8

T9

T10

T1

T2

High g

Low g

Intelligence test batteryIndividual test g (generalIntelligence) loadings

Derived from factor analysis

Think of a general intelligence pole that is saturated with more g-ness (like magnetism) at the top and less g-ness at the bottom.

Factor analysis orders the tests on the pole based on their saturation of g-ness

IQ test battery subtestg-loadings or saturation

Subtests

General intelligence (g)

WISC/WISC-R/WAIS/WAIS-R MR/ID subtest g-loading pattern research

Also astounding is the study-by-study consistency in the subtests that emerge as “easy” (Picture Completion, Object Assembly, Block Design) or “hard” (Arithmetic, Vocabulary, Information) for diverse samples of retarded populations

(Kaufman, 1979, p.203)

(28 studies)

0.55 0.65 0.75 0.85 0.954

5

6

7

8

9

10

11

12

13

14

15

16

DigSpn

Dig Sym

Voc

Info

Sim

Cmp

Arith

BlkD

PicC

PicA

ObAsm

High subtest scaled score

Low subtest scaled score

WAIS-R Subtest g (general intelligence loadings (Kaufman, 1990, p. 253)

Low g: less cognitively abstract/complex

High g: More cognitively

abstract/complex(Good or high g)(Fair or moderate g)

____

____

WAI

S-R

subt

est s

cale

d sc

ores

1988 WAIS-R1993 WAIS-R

Plot of ________ 1988 and 1993 WAIS-R Subtest Scaled Scores by g (general intelligence) loadings

0.55 0.65 0.75 0.85 0.954

5

6

7

8

9

10

11

12

13

14

15

16

DigSpn

Dig Sym

Voc

Info

Sim

Cmp

Arith

BlkD

PicC

PicA

ObAsm







____

____

___

WAI

S-R

subt

est s

cale

d sc

ores


Plot of _________WAIS-R Subtest Scaled Scores by g (general intelligence) loadings

Rank-order correlation of ___ 1993 WAIS-R subtest scores test g-loadings is -.71.

Rank-order correlation of ___ 1988 WAIS-R subtest scores test g-loadings is -.68.

This is a form of internal convergence validity evidence for

MR/ID Dx

0.55 0.65 0.75 0.85 0.954

5

6

7

8

9

10

11

12

13

14

15

16

DigSpn

Dig Sym

Voc

Info

Sim

Cmp

Arith

BlkD

PicC

PicA

ObAsm







____

____

__ W

AIS-

R su

btes

t sca

led

scor

es


Plot of _________WAIS-R Subtest Scaled Scores by g (general intelligence) loadings

Eliminated from FS IQ in WAIS-III revision (supplemental subtest) & dropped from battery in WAIS-IV revision

Eliminated from FS IQ in WAIS-IV revision (supplemental subtest)Dropped from battery in WAIS-IV revision

The WAIS-III/IV batteries include more complex tests (than the WAIS-R) and are better indicators of general intelligence

The state expert would not recognize (continued to ignore) this scientific fact and held on to the WAIS-R scores as the most accurate – the rest of lower scores due to malingering

Outliers – why?

Most likely scientific explanations in this case

• Ability content differences between different tests or different revisions of the same test

• Little known psychometric problems with some of the “gold standards”

CHC IQ Test Batteries DNA Fingerprints

The publisher, in both the WAIS-III/WAIS-IV manuals, describes changes in abilities measured to improve the battery to be consistent with contemporary research

The state expert would not recognize (continued to ignore) this scientific fact and held on to the WAIS-R scores as the most accurate – the rest of lower scores due to malingering

Recommended article re: CHC theory of intelligence

(Many more at ICDP blog)

GeneralAbility (g)

DichotomousAbilities

MultipleCognitive Abilities

(Incomplete; not implicitlyor explicitly CHC-organized

MultipleCognitive Abilities(Incomplete; implicitly

or explicitly CHC-organized

MultipleCognitive Abilities(“Complete”; implicitly


Spearman Original Gf-Gc Thurstone PMAs Cattell-Horn-Carroll (CHC)Theory of Cognitive Abilities

W-B (1939; 1946)WAIS-R (1981) WAIS-III (1997) WAIS-IV (2008)

g

Continuum of Progress: Intelligence Theories and the Evolution of the Wechsler Adult IQ Battery

Broad Abilities

The WAIS-III and WAIS-IV revisions made the battery more consistent with contemporary neurocognitive and intelligence research. They are more valid indicators of general intelligence (supported by WAIS-III/IV tech manuals and independent reviews) than the older WAIS-R.

The changes in abilities measured from the WAIS-R to the WAIS-III/IV help explain the WAIS-R “outlier” scores

The WAIS-IV should not be considered “the gold standard” as per the consensus CHC model of intelligence.

CHC is now considered to be the consensus

model of the structure of intelligence

GeneralAbility (g)









W-B (1939; 1946)WAIS-R (1981) WAIS-III (1997)

g

Continuum of Progress: Intelligence Theories and the Wechsler Adult IQ Battery

Broad Abilities

WJ-R (1989)WJ III (2001)WJ III NU (2005)

WJ (1977)

Stanford-Binet LM

(1937; 1960; 1972)

SB-IV (1986) SB-V(2003)

The revisions made to other IQ batteries (with adult norms SB and WJ)

also changed the composition of their

composite IQ scores and is a likely source of score

differences that must be considered

WAIS-IV (2008)

GeneralAbility (g)









W-B (1939; 1946)WAIS-R (1981) WAIS-III (1997)

g


Broad Abilities


WJ (1977)

Stanford-Binet LM

(1937; 1960; 1972)

SB-IV (1986) SB-V(2003)

Knowing the ability coverage similarities and differences is important

when comparing and understanding possible IQ-IQ differences between the

latest versions of these batteries

WAIS-IV (2008)

GeneralAbility (g)









W-B (1939; 1946)WAIS-R (1981) WAIS-III (1997)

g


Broad Abilities


WJ (1977)

Stanford-Binet LM

(1937; 1960; 1972)

SB-IV (1986) SB-V(2003)

IQ-IQ score difference explanations may require

knowledge of across and within battery revision ability

coverage understanding. There are many possible scenarios when there is a history of IQ

testing within the same battery system or across battery

systems

WAIS-IV (2008)

Appl

ied

IQ B

atter

ies

GeneralAbility (g)








Prim

ary

Theo

ries

(Neu

rops

ych.

Psy

chom

etric

)

Spearman Original Gf-Gc

Simultaneous-Successive

Thurstone PMAs

PASS(Planning, Attention,

Simultaneous, Successive)

Cattell-Horn Carroll (CHC)Theory of Cognitive Abilities

WJ-R (1989) WJ III (2001)WJ III NU (2005)

WJ (1977)

Stanford-Binet LM

(1937; 1960; 1972)

SB-IV (1986) SB-V(2003)

DAS-II (2007)CAS (1997)DAS (1990)

WPPSI-R (1989)WISC-R (1974)

W-B (1939; 1946)WAIS-R (1981)

WPPSI-III (2002)WISC-III 1991)WAIS-III (1997)

WISC-IV (2003)WAIS-IV (2008)

K-ABC (1983)KAIT (1993)

KABC-II (2004)

g

Continuum of Progress: Intelligence Theories and Test Batteries

Broad Abilities

When childhood and adult battery scores are available the

interpretation of IQ-IQ differences due to ability

coverage differences becomes even more complex

Knowledge of CHC ability coverage critical when brief special purpose (e.g.,

nonverbaI) IQ scores are reported

TONI-2/Ravens/ 100% Gf

The state expert argued that some of the lower subtest scores (after the WAIS-R’s) was further evidence of

malingering


State expert argued that variability in Wechsler subtest scores, esp. lower scores post-Atkins were obvious sign of malingering …

thus supporting the conclusion that the WAIS-R scores were the best estimate of general intelligence

The implied“You can’t fake

smart” strategy or interpretation

There is an EXTREME amount of variability in the professional expertise in IQ subtest profile interpretation: Scientific/psychometric vs.

“clinical” lore-based interpretation

VS

Recall the standard error of the difference (SEdiff) must be used to ascertain if the scores in question are reliably different

Plot of ___________WAIS-R & WAIS-III Similarities scores (+- 95 SEM) - Range of 4

Average (median = 5.0)

95% SEM band (median = +- 1.7)

Date

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

Sca

led

sco

re

No statistically reliable difference across all scores

Plot of ______________WAIS-R & WAIS-III Comprehension scores (+- 95 SEM) - Range of 4



Date

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

Sca

led

sco

re

No statistically reliable difference across all scores

Plot of __________ WAIS-R, WAIS-III & WAIS-IV Digit Span scores (+- 95 SEM) – Range of 7



Date

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

Sca

led

sco

reAs reported in WAIS-R tech. manual, DS has poor reliability (mdn = .81) – 4th weakest in battery. Thus some variability to be expected. And, the WAIS-IV DS is a three-component and not two component test—so they are not measuring the exact same construct

7 point difference There is a scientific explanation

Plot of ________WAIS-R, WAIS-III & WAIS-IV Picture Completion scores (+- 95 SEM) - Range of 6

95% SEM band (median = +- 2.5 )


1,98

61,

988

1,99

01,

992

1,99

41,

996

1,99

82,

000

2,00

22,

004

2,00

62,

008

2,01

0

DATE

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

PIC

C

There is a scientific explanation

On the WAIS-RWAIS-III revision. “Only 50% of the content of Picture Completion and Picture Arrangement was retained from the WAIS-R, and only 60 % of the Object Assembly items were retained. In addition, the correlations between WAIS-R and WAIS-III version of these subtests are relatively low (r’s of .59 - .63)” ------ 35 – 40 % shared variance

(Kaufman & Lichtenberger,

2002, p. 91)

The state expert proposed an Expected WAIS-III IQ (based on

WAIS-R IQ) – Actual WAIS-III discrepancy method to support

malingering hypothesis


WAIS-R IQ 85 Expected WAIS-III 81-83 (will us 82 for discussion)

D

WAIS-R IQ 85 Expected WAIS-III 81-83 (will us 82 for discussion)

Obtained WAIS-III scores lower than “expected/predicted” = malingering

according to state expert

All other lower scores = malingering as per state expert

Major flaws with this method and logic (part of commonly stated or implied -- “You can’t fake smart” strategy

• There is no need to estimate WAIS-III scores as actual WAIS-III scores exist

• No scientific or professional evidence or literature suggesting the use or validity of this method

• The technical manuals do not recommend the use of these tables for this purpose. The purpose for presenting in TM is to demonstrate concurrent criterion validity. This information clearly was not presented in the TM to support this type of use

• If such a procedure were to be used, the study would need to include subjects that had WAIS-III 9+ years later than WAIS-R (not average of 4.7 weeks)

• The tables do not include the standard error of equating (esp. around the cut score of 70) which would be required as per the Joint Test Standards if the table was intended to be used for this purpose

• If intended for this purpose, the publisher would have had to conduct a properly designed equating study (rectangular distribution; minimum n recommended is 400 to 1,500 – not 192.)

• etc., etc., etc.

The only scientifically accepted method for

predicting one score from another is to use the

correlation and a prediction model

WAIS-R/WAIS-III correlation of .93 would

suggest very accurate prediction

…..but all prediction has error that can be

quantified as the standard error of estimate (SEest)

Using WAIS-R IQ scores and standard prediction model based on WAIR-R/WAIS-III r = .93, best predicted WAIS-III given WAIS-R scores is 81

But there is prediction error

• 1 SEest (68% confidence) = + 5.5• 2 SEest (95 % confidence) = +11.0

Thus, given this person’s WAIS-R score, the only scientifically accepted expected/predicted WAIS-III score is 81 + 11 pts -- 95 % confidence band of predicted/expected WAIS-III score of 70 to 92

D

Only appropriate predicted/expected WAIS-III score prediction (95%

confidence) is a range from 72 to 90

All actual WAIS-III IQ scores have SEM confidence bands that

overlap with SEest (standard error of estimate - error of

prediction) band based on WAIS-R score. Thus, all 3 WAIS-III

scores are not reliably statistically different from

predicted score

The state expert characterized defendant’s measured

achievement (WJ III) as “quite impressive” given his level of

measured intelligence – at levels inconsistent with MR/ID Dx

The IQ = ACH fallacy argument


Problems with “impressive” achievement argument

Defendant’s original WJ III achievement scores were based on original 2001 norms. Failed to rescore and reinterpret in light of WJ III 2007 Normative Update (WJ III NU)

Selective “cherry picking” of relatively high scores and failure to utilize most “real world” score metrics to establish functional academic skills

• Ignored cognitive measures on WJ III Ach. Battery consistent with MR/ID

IQ = ACH fallacy

Cogmeasures

Cogmeasures

State expert

focused on these scores

Test authors &

pub rec this as best metric

Hardly “quite impressive”

Recall the standard error of the estimate (SEest) must be used estimate the amount

of error in the IQ ACH prediction

The Reality of IQ Achievement Predicted Scores

IQACH correlation in scientific literature (for adults) reported from .50 to .60

Prediction error (SEest) when r = .50 to .60

• 1 SEest (68% confidence) = + 12/13• 2 SEest (95 % confidence) = + 24/26

State expert used IQ of 73 within the context of his “impressive” conclusion. Using this score, the scientifically accepted range of expected/predicted achievement scores is approximately 72 to 98 (68% confidence) and 59 to 111 (95% confidence)

The defendants WJ III NU ach. standard scores are well within these expected ranges

The IQ Achievement Fallacy: One cannot achieve above your IQ score

Thus, for any given IQ score:

• Half of all individuals will obtain achievement scores at or below their IQ score.

• Half of all students will obtain achievement scores at or above their IQ score!

The IQ Achievement Fallacy: One cannot achieve above your IQ score

(often used as part of “You can’t fake smart” argument)

IQACH correlations of .50 to .60 indicate that IQ accounts for only approximately 25% to 40% of ach. test scores.

Other “You can’t fake smart” examples I have seen (not exhaustive list)

The use of the National Adult Reading Test (NART), a commonly used measure to predict “premorbid” intelligence in neuropsych settings, to predict expected IQ scores against which an existing score is compared

The use of neuropsych “demographically adjusted (Heaton)” norms


Use of group aptitude measures (ASVAB; AFQT) as convergent validity evidence

Gf Gq Gc Glr Ga Gv Gsm Gs Grw Gk

ASVAB 0.15 0 0 0 0 0 0 0.25 0.3 0.3

ASVAB AFQT

0.25 0 0 0 0 0 0 0 0.5 0.25

5%

15%

25%

35%

45%

55%

65%

75%

85%

95%

% C

HC

bro

ad a

bili

ties

rep

rese

nte

d is

AS

VA

B a

nd

A

SV

AB

AF

QT

sco

re Note. ASVAB Verbal tests (Verbal Comp or VL as per

CHC model/theory) also tap Gc abilities, but require the subject to read the items…thus involving Grw abilities

Major cognitive ability domains sampled across the major individualized IQ batteries (Wechslers, Stanford-Binet, WJ III/BAT

III) which are combined to produce general intelligence (g) full-scale global composite IQ score

Other human ability domains (acquired acculturated

knowledge) included in the ASVAB differential aptitude test

battery

Proportional CHC broad ability coverage of ASVAB and ASVAB-derived AFQT score


Unknown problems with some of the older “gold standards”: Often due to lack of due diligence and expertise

The 1960 SB was not a renorming (data gathered for item ordering work)

• 1960 SB norms still based on 1932 norming sample

• Any 1960 SB score may suffer from extreme Flynn effect (e.g. if tested in 1972 with 1960 SB, FE of approximately 12 points)

The 1986 SB-IV had serious psychometric problems (Reynolds, 1987 & others)

• Underepresentative standardization sample (“far below industry standards”)

• “IQ roulette”

• “I believe the use of the S-B IV IQs to be logically indefensible, and I certainly would not want to defend their accuracy or validity in a court of law” (Reynolds, 1987; p. 141)


Unknown problems with some of the older “gold standards”

• WAIS-R norm sample for 16 to 19 year olds have been demonstrated to be suspect and “soft.”

Simply put, the WAIS-R norms for 16-19-year-olds are suspect and examiners should interpret [them] with extreme caution. The norms for 16-19-year-olds are ‘soft’ or ‘easy’ because the reference group performed more poorly than 16-to-19-year-olds really perform in the general population. The surprising result is that the IQs of 16- through 19-year-olds tested on the WAIS-R will be spuriously high by 3 to 5 points” (p. 85, italics added).

Kaufman (1990)


Kevin S. McGrew, PhD

DirectorInstitute for Applied Psychometrics (IAP)

www.themindhumb.com

iq score interpretation in atkins mr/id death penalty cases: the good, bad and the ugly

Technology

iqiq difference

iqiq exchangeability

iqiq score differences

iq tests

obtained iq scores

scale iq

iq score interpretations

scores median