iq score interpretation in atkins mr/id death penalty cases: the good, bad and the ugly
DESCRIPTION
I presented this at the 2012 Habeas Assistance Training Seminar in Washington DC, Aug, 2012. It reviews a number of psychometric issues in Atkins MR/ID death penalty cases using examples from a recent completed case and other cases as well.TRANSCRIPT
IQ Score Interpretations in Atkins Cases
Kevin S. McGrew, PhD
DirectorInstitute for Applied Psychometrics (IAP)
Additional info re:Kevin McGrew and IAP
can be found at the MindHub™ web portal
www.themindhub.com
For additional information and to stay current (ICDP blog)
www.atkinsmrdeathpenaltly.com
….
….
….
ICDP
ICDP
ICDP
ICDP
IQ Score Interpretations in Atkins Cases
A recently successful Atkins case (state agreed to LWOP a few weeks prior to evidentiary hearing) is bases of presentation but will be
augmented with information from other cases
Case involved the Flynn Effect: But we will not be covering
today
Recommended article (more at ICDP blog)
ICDP
“Outliers” – why?
State expert built argument around the WAIS-R scores being the best estimates of defendants true intelligence (underlying “You can’t fake bad” strategy ) and dismissed other scores as most likely due to malingering—arguments not based on sound and reliable methods of science
State expert failed in professional due diligence to consider scientific based explanations of the consistencies and inconsistencies in the complete collection of scores
Median of all = 68
It is statistically or mathematically inappropriate to compute the arithmetic average (mean) of IQ scores. The median is acceptable, under certain circumstances
The only way to compute an average (mean) IQ score is to use a complex equation that incorporates the reliabilities of all scores and the intercorrelations among all scores
Median is acceptable metric under certain conditions
Strong convergence of indicators
Fundamental Issue: Comparability (Exchangeability) of IQ Scores
Intellectual Functioning: Conceptual Issues
Kevin S. McGrew and Keith F. Widaman
AAIDD Death Penalty Manual Chapter (in preparation)
Fundamental Issue: Comparability of IQ Scores
“Not all scores obtained on intelligence tests given to the same person will be identical” (AAIDD, 2010, p. 38)
The global (full scale) IQ from different tests are frequently similar…Other times the IQ scores will be markedly different…a finding that often produces consternation for examiners and recipients of psychological reports
Fundamental Issue: Comparability of IQ Scores
Floyd et al. (2008) used generalizability theory methods to evaluate IQ-IQ exchangeability across ten different IQ battery global composite g-score composites (comprised of 6 to 14 individual tests) across approximately 1,000 subjects
Fundamental Issue: Comparability of IQ Scores
Average (mdn) r = .76 – lets round to .80
Coefficient of determination r2 x 100 = 64 % shared variance
Test A
Test B
.r = .80
Shared common abilities
Fundamental Issue: Comparability of IQ Scores
Test A
Test B
.r = .80
Shared common abilities
“psychologists can anticipate that 1 in 4 individuals taking an intelligence test battery will receive an IQ more than 10 points higher or lower when taking another battery”
Floyd et al. (2008)
The standard error of the difference (SEdiff) must be used to ascertain if the scores in
question are reliably different
SEdiff = 15 x SQRT[2 - r11 - r22]
Test A reliability = .95Test B reliability = .93
1 SEdiff (68 % confidence) = 5.2 points2 SEdiff (95 % confidence) = 10.4 points
Before interpreting the scores from these two IQ tests as being significantly difference, an IQ-IQ difference of at least
10+ points would be required
Easier way via use of confidence band rule-of-thumb
e.g., None of these 6 tests
are sign. different from one another
e.g., Not sign.
different from each
other
If 95 % SEM confidence bands for compared scores do not touch, the difference is likely a reliable difference and hypotheses about the difference should be enteratined
If 95 % SEM confidence bands for compared scores overlap, then the difference is likely not a reliable difference and should not generate significant hypotheses about score differences.
The standard error of the difference (SEdiff)
confidence band rule-of-thumb
e.g., WAIS-R score differences represent
reliable differences with all other obtained IQ
scores
The higher WAIS-R
scores is a scientifically based fact
in this case. One needs to accept
and to explain why.
IQ-IQ score differences: Scientific hypotheses that warrant exploration
• Test administration or scoring errors• Practice effects• Malingering / effort• Norm obsolescence (Flynn effect)• Content differences between different tests or different revisions of the same test• Little known psychometric problems with some of the “gold standards”• Individual/situational factors for person or specific test session
Today will focus only on select topics –
only those relevant to this example case
and some of the more unknown or
misunderstood issues
Unscientific IQ-IQ score difference hypthoses I have seen or read
Voodoo psychometrics
Will focus only on select topics – esp.
those relevant to the example case and some of the more
unknown or misunderstood issues
Outliers – why?
Most likely scientific explanations in this case Ability content differences between different tests or different revisions of the same test
• “Drilling down” further – changes in g-loadings/saturation of subtests included on WAIS-R and WAIS-III/IV
T3
T4
T5
T6
T7
T8
T9
T10
T1
T2
High g
Low g
Intelligence test batteryIndividual test g (generalIntelligence) loadings
Derived from factor analysis
Think of a general intelligence pole that is saturated with more g-ness (like magnetism) at the top and less g-ness at the bottom.
Factor analysis orders the tests on the pole based on their saturation of g-ness
IQ test battery subtestg-loadings or saturation
Subtests
General intelligence (g)
WISC/WISC-R/WAIS/WAIS-R MR/ID subtest g-loading pattern research
Also astounding is the study-by-study consistency in the subtests that emerge as “easy” (Picture Completion, Object Assembly, Block Design) or “hard” (Arithmetic, Vocabulary, Information) for diverse samples of retarded populations
(Kaufman, 1979, p.203)
(28 studies)
0.55 0.65 0.75 0.85 0.954
5
6
7
8
9
10
11
12
13
14
15
16
DigSpn
Dig Sym
Voc
Info
Sim
Cmp
Arith
BlkD
PicC
PicA
ObAsm
High subtest scaled score
Low subtest scaled score
WAIS-R Subtest g (general intelligence loadings (Kaufman, 1990, p. 253)
Low g: less cognitively abstract/complex
High g: More cognitively
abstract/complex(Good or high g)(Fair or moderate g)
____
____
WAI
S-R
subt
est s
cale
d sc
ores
1988 WAIS-R1993 WAIS-R
Plot of ________ 1988 and 1993 WAIS-R Subtest Scaled Scores by g (general intelligence) loadings
0.55 0.65 0.75 0.85 0.954
5
6
7
8
9
10
11
12
13
14
15
16
DigSpn
Dig Sym
Voc
Info
Sim
Cmp
Arith
BlkD
PicC
PicA
ObAsm
High subtest scaled score
Low subtest scaled score
WAIS-R Subtest g (general intelligence loadings (Kaufman, 1990, p. 253)
Low g: less cognitively abstract/complex
High g: More cognitively
abstract/complex(Good or high g)(Fair or moderate g)
____
____
___
WAI
S-R
subt
est s
cale
d sc
ores
1988 WAIS-R1993 WAIS-R
Plot of _________WAIS-R Subtest Scaled Scores by g (general intelligence) loadings
Rank-order correlation of ___ 1993 WAIS-R subtest scores test g-loadings is -.71.
Rank-order correlation of ___ 1988 WAIS-R subtest scores test g-loadings is -.68.
This is a form of internal convergence validity evidence for
MR/ID Dx
0.55 0.65 0.75 0.85 0.954
5
6
7
8
9
10
11
12
13
14
15
16
DigSpn
Dig Sym
Voc
Info
Sim
Cmp
Arith
BlkD
PicC
PicA
ObAsm
High subtest scaled score
Low subtest scaled score
WAIS-R Subtest g (general intelligence loadings (Kaufman, 1990, p. 253)
Low g: less cognitively abstract/complex
High g: More cognitively
abstract/complex(Good or high g)(Fair or moderate g)
____
____
__ W
AIS-
R su
btes
t sca
led
scor
es
1988 WAIS-R1993 WAIS-R
Plot of _________WAIS-R Subtest Scaled Scores by g (general intelligence) loadings
Eliminated from FS IQ in WAIS-III revision (supplemental subtest) & dropped from battery in WAIS-IV revision
Eliminated from FS IQ in WAIS-IV revision (supplemental subtest)Dropped from battery in WAIS-IV revision
The WAIS-III/IV batteries include more complex tests (than the WAIS-R) and are better indicators of general intelligence
The state expert would not recognize (continued to ignore) this scientific fact and held on to the WAIS-R scores as the most accurate – the rest of lower scores due to malingering
Outliers – why?
Most likely scientific explanations in this case
• Ability content differences between different tests or different revisions of the same test
• Little known psychometric problems with some of the “gold standards”
CHC IQ Test Batteries DNA Fingerprints
The publisher, in both the WAIS-III/WAIS-IV manuals, describes changes in abilities measured to improve the battery to be consistent with contemporary research
The state expert would not recognize (continued to ignore) this scientific fact and held on to the WAIS-R scores as the most accurate – the rest of lower scores due to malingering
Recommended article re: CHC theory of intelligence
(Many more at ICDP blog)
GeneralAbility (g)
DichotomousAbilities
MultipleCognitive Abilities
(Incomplete; not implicitlyor explicitly CHC-organized
MultipleCognitive Abilities(Incomplete; implicitly
or explicitly CHC-organized
MultipleCognitive Abilities(“Complete”; implicitly
or explicitly CHC-organized
Spearman Original Gf-Gc Thurstone PMAs Cattell-Horn-Carroll (CHC)Theory of Cognitive Abilities
W-B (1939; 1946)WAIS-R (1981) WAIS-III (1997) WAIS-IV (2008)
g
Continuum of Progress: Intelligence Theories and the Evolution of the Wechsler Adult IQ Battery
Broad Abilities
The WAIS-III and WAIS-IV revisions made the battery more consistent with contemporary neurocognitive and intelligence research. They are more valid indicators of general intelligence (supported by WAIS-III/IV tech manuals and independent reviews) than the older WAIS-R.
The changes in abilities measured from the WAIS-R to the WAIS-III/IV help explain the WAIS-R “outlier” scores
The WAIS-IV should not be considered “the gold standard” as per the consensus CHC model of intelligence.
CHC is now considered to be the consensus
model of the structure of intelligence
GeneralAbility (g)
DichotomousAbilities
MultipleCognitive Abilities
(Incomplete; not implicitlyor explicitly CHC-organized
MultipleCognitive Abilities(Incomplete; implicitly
or explicitly CHC-organized
MultipleCognitive Abilities(“Complete”; implicitly
or explicitly CHC-organized
Spearman Original Gf-Gc Thurstone PMAs Cattell-Horn-Carroll (CHC)Theory of Cognitive Abilities
W-B (1939; 1946)WAIS-R (1981) WAIS-III (1997)
g
Continuum of Progress: Intelligence Theories and the Wechsler Adult IQ Battery
Broad Abilities
WJ-R (1989)WJ III (2001)WJ III NU (2005)
WJ (1977)
Stanford-Binet LM
(1937; 1960; 1972)
SB-IV (1986) SB-V(2003)
The revisions made to other IQ batteries (with adult norms SB and WJ)
also changed the composition of their
composite IQ scores and is a likely source of score
differences that must be considered
WAIS-IV (2008)
GeneralAbility (g)
DichotomousAbilities
MultipleCognitive Abilities
(Incomplete; not implicitlyor explicitly CHC-organized
MultipleCognitive Abilities(Incomplete; implicitly
or explicitly CHC-organized
MultipleCognitive Abilities(“Complete”; implicitly
or explicitly CHC-organized
Spearman Original Gf-Gc Thurstone PMAs Cattell-Horn-Carroll (CHC)Theory of Cognitive Abilities
W-B (1939; 1946)WAIS-R (1981) WAIS-III (1997)
g
Continuum of Progress: Intelligence Theories and the Wechsler Adult IQ Battery
Broad Abilities
WJ-R (1989)WJ III (2001)WJ III NU (2005)
WJ (1977)
Stanford-Binet LM
(1937; 1960; 1972)
SB-IV (1986) SB-V(2003)
Knowing the ability coverage similarities and differences is important
when comparing and understanding possible IQ-IQ differences between the
latest versions of these batteries
WAIS-IV (2008)
GeneralAbility (g)
DichotomousAbilities
MultipleCognitive Abilities
(Incomplete; not implicitlyor explicitly CHC-organized
MultipleCognitive Abilities(Incomplete; implicitly
or explicitly CHC-organized
MultipleCognitive Abilities(“Complete”; implicitly
or explicitly CHC-organized
Spearman Original Gf-Gc Thurstone PMAs Cattell-Horn-Carroll (CHC)Theory of Cognitive Abilities
W-B (1939; 1946)WAIS-R (1981) WAIS-III (1997)
g
Continuum of Progress: Intelligence Theories and the Wechsler Adult IQ Battery
Broad Abilities
WJ-R (1989)WJ III (2001)WJ III NU (2005)
WJ (1977)
Stanford-Binet LM
(1937; 1960; 1972)
SB-IV (1986) SB-V(2003)
IQ-IQ score difference explanations may require
knowledge of across and within battery revision ability
coverage understanding. There are many possible scenarios when there is a history of IQ
testing within the same battery system or across battery
systems
WAIS-IV (2008)
Appl
ied
IQ B
atter
ies
GeneralAbility (g)
DichotomousAbilities
MultipleCognitive Abilities
(Incomplete; not implicitlyor explicitly CHC-organized
MultipleCognitive Abilities(Incomplete; implicitly
or explicitly CHC-organized
MultipleCognitive Abilities(“Complete”; implicitly
or explicitly CHC-organized
Prim
ary
Theo
ries
(Neu
rops
ych.
Psy
chom
etric
)
Spearman Original Gf-Gc
Simultaneous-Successive
Thurstone PMAs
PASS(Planning, Attention,
Simultaneous, Successive)
Cattell-Horn Carroll (CHC)Theory of Cognitive Abilities
WJ-R (1989) WJ III (2001)WJ III NU (2005)
WJ (1977)
Stanford-Binet LM
(1937; 1960; 1972)
SB-IV (1986) SB-V(2003)
DAS-II (2007)CAS (1997)DAS (1990)
WPPSI-R (1989)WISC-R (1974)
W-B (1939; 1946)WAIS-R (1981)
WPPSI-III (2002)WISC-III 1991)WAIS-III (1997)
WISC-IV (2003)WAIS-IV (2008)
K-ABC (1983)KAIT (1993)
KABC-II (2004)
g
Continuum of Progress: Intelligence Theories and Test Batteries
Broad Abilities
When childhood and adult battery scores are available the
interpretation of IQ-IQ differences due to ability
coverage differences becomes even more complex
Knowledge of CHC ability coverage critical when brief special purpose (e.g.,
nonverbaI) IQ scores are reported
TONI-2/Ravens/ 100% Gf
The state expert argued that some of the lower subtest scores (after the WAIS-R’s) was further evidence of
malingering
Voodoo psychometrics
State expert argued that variability in Wechsler subtest scores, esp. lower scores post-Atkins were obvious sign of malingering …
thus supporting the conclusion that the WAIS-R scores were the best estimate of general intelligence
The implied“You can’t fake
smart” strategy or interpretation
There is an EXTREME amount of variability in the professional expertise in IQ subtest profile interpretation: Scientific/psychometric vs.
“clinical” lore-based interpretation
VS
Recall the standard error of the difference (SEdiff) must be used to ascertain if the scores in question are reliably different
Plot of ___________WAIS-R & WAIS-III Similarities scores (+- 95 SEM) - Range of 4
Average (median = 5.0)
95% SEM band (median = +- 1.7)
Date
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Sca
led
sco
re
No statistically reliable difference across all scores
Plot of ______________WAIS-R & WAIS-III Comprehension scores (+- 95 SEM) - Range of 4
Average (median = 5.5)
95% SEM band (median = +- 2.3)
Date
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Sca
led
sco
re
No statistically reliable difference across all scores
Plot of __________ WAIS-R, WAIS-III & WAIS-IV Digit Span scores (+- 95 SEM) – Range of 7
95% SEM band (median = +- 1.9)
Average (median = 5.5)
Date
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Sca
led
sco
reAs reported in WAIS-R tech. manual, DS has poor reliability (mdn = .81) – 4th weakest in battery. Thus some variability to be expected. And, the WAIS-IV DS is a three-component and not two component test—so they are not measuring the exact same construct
7 point difference There is a scientific explanation
Plot of ________WAIS-R, WAIS-III & WAIS-IV Picture Completion scores (+- 95 SEM) - Range of 6
95% SEM band (median = +- 2.5 )
Average (median = 4.5)
1,98
61,
988
1,99
01,
992
1,99
41,
996
1,99
82,
000
2,00
22,
004
2,00
62,
008
2,01
0
DATE
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
PIC
C
There is a scientific explanation
On the WAIS-RWAIS-III revision. “Only 50% of the content of Picture Completion and Picture Arrangement was retained from the WAIS-R, and only 60 % of the Object Assembly items were retained. In addition, the correlations between WAIS-R and WAIS-III version of these subtests are relatively low (r’s of .59 - .63)” ------ 35 – 40 % shared variance
(Kaufman & Lichtenberger,
2002, p. 91)
The state expert proposed an Expected WAIS-III IQ (based on
WAIS-R IQ) – Actual WAIS-III discrepancy method to support
malingering hypothesis
Voodoo psychometrics
WAIS-R IQ 85 Expected WAIS-III 81-83 (will us 82 for discussion)
D
WAIS-R IQ 85 Expected WAIS-III 81-83 (will us 82 for discussion)
Obtained WAIS-III scores lower than “expected/predicted” = malingering
according to state expert
All other lower scores = malingering as per state expert
Major flaws with this method and logic (part of commonly stated or implied -- “You can’t fake smart” strategy
• There is no need to estimate WAIS-III scores as actual WAIS-III scores exist
• No scientific or professional evidence or literature suggesting the use or validity of this method
• The technical manuals do not recommend the use of these tables for this purpose. The purpose for presenting in TM is to demonstrate concurrent criterion validity. This information clearly was not presented in the TM to support this type of use
• If such a procedure were to be used, the study would need to include subjects that had WAIS-III 9+ years later than WAIS-R (not average of 4.7 weeks)
• The tables do not include the standard error of equating (esp. around the cut score of 70) which would be required as per the Joint Test Standards if the table was intended to be used for this purpose
• If intended for this purpose, the publisher would have had to conduct a properly designed equating study (rectangular distribution; minimum n recommended is 400 to 1,500 – not 192.)
• etc., etc., etc.
The only scientifically accepted method for
predicting one score from another is to use the
correlation and a prediction model
WAIS-R/WAIS-III correlation of .93 would
suggest very accurate prediction
…..but all prediction has error that can be
quantified as the standard error of estimate (SEest)
Using WAIS-R IQ scores and standard prediction model based on WAIR-R/WAIS-III r = .93, best predicted WAIS-III given WAIS-R scores is 81
But there is prediction error
• 1 SEest (68% confidence) = + 5.5• 2 SEest (95 % confidence) = +11.0
Thus, given this person’s WAIS-R score, the only scientifically accepted expected/predicted WAIS-III score is 81 + 11 pts -- 95 % confidence band of predicted/expected WAIS-III score of 70 to 92
D
Only appropriate predicted/expected WAIS-III score prediction (95%
confidence) is a range from 72 to 90
All actual WAIS-III IQ scores have SEM confidence bands that
overlap with SEest (standard error of estimate - error of
prediction) band based on WAIS-R score. Thus, all 3 WAIS-III
scores are not reliably statistically different from
predicted score
The state expert characterized defendant’s measured
achievement (WJ III) as “quite impressive” given his level of
measured intelligence – at levels inconsistent with MR/ID Dx
The IQ = ACH fallacy argument
Voodoo psychometrics
Problems with “impressive” achievement argument
Defendant’s original WJ III achievement scores were based on original 2001 norms. Failed to rescore and reinterpret in light of WJ III 2007 Normative Update (WJ III NU)
Selective “cherry picking” of relatively high scores and failure to utilize most “real world” score metrics to establish functional academic skills
• Ignored cognitive measures on WJ III Ach. Battery consistent with MR/ID
IQ = ACH fallacy
Cogmeasures
Cogmeasures
State expert
focused on these scores
Test authors &
pub rec this as best metric
Hardly “quite impressive”
Recall the standard error of the estimate (SEest) must be used estimate the amount
of error in the IQ ACH prediction
The Reality of IQ Achievement Predicted Scores
IQACH correlation in scientific literature (for adults) reported from .50 to .60
Prediction error (SEest) when r = .50 to .60
• 1 SEest (68% confidence) = + 12/13• 2 SEest (95 % confidence) = + 24/26
State expert used IQ of 73 within the context of his “impressive” conclusion. Using this score, the scientifically accepted range of expected/predicted achievement scores is approximately 72 to 98 (68% confidence) and 59 to 111 (95% confidence)
The defendants WJ III NU ach. standard scores are well within these expected ranges
The IQ Achievement Fallacy: One cannot achieve above your IQ score
Thus, for any given IQ score:
• Half of all individuals will obtain achievement scores at or below their IQ score.
• Half of all students will obtain achievement scores at or above their IQ score!
The IQ Achievement Fallacy: One cannot achieve above your IQ score
(often used as part of “You can’t fake smart” argument)
IQACH correlations of .50 to .60 indicate that IQ accounts for only approximately 25% to 40% of ach. test scores.
Other “You can’t fake smart” examples I have seen (not exhaustive list)
The use of the National Adult Reading Test (NART), a commonly used measure to predict “premorbid” intelligence in neuropsych settings, to predict expected IQ scores against which an existing score is compared
The use of neuropsych “demographically adjusted (Heaton)” norms
Other “You can’t fake smart” examples I have seen (not exhaustive list)
Use of group aptitude measures (ASVAB; AFQT) as convergent validity evidence
Gf Gq Gc Glr Ga Gv Gsm Gs Grw Gk
ASVAB 0.15 0 0 0 0 0 0 0.25 0.3 0.3
ASVAB AFQT
0.25 0 0 0 0 0 0 0 0.5 0.25
5%
15%
25%
35%
45%
55%
65%
75%
85%
95%
% C
HC
bro
ad a
bili
ties
rep
rese
nte
d is
AS
VA
B a
nd
A
SV
AB
AF
QT
sco
re Note. ASVAB Verbal tests (Verbal Comp or VL as per
CHC model/theory) also tap Gc abilities, but require the subject to read the items…thus involving Grw abilities
Major cognitive ability domains sampled across the major individualized IQ batteries (Wechslers, Stanford-Binet, WJ III/BAT
III) which are combined to produce general intelligence (g) full-scale global composite IQ score
Other human ability domains (acquired acculturated
knowledge) included in the ASVAB differential aptitude test
battery
Proportional CHC broad ability coverage of ASVAB and ASVAB-derived AFQT score
Other “You can’t fake smart” examples I have seen (not exhaustive list)
Unknown problems with some of the older “gold standards”: Often due to lack of due diligence and expertise
The 1960 SB was not a renorming (data gathered for item ordering work)
• 1960 SB norms still based on 1932 norming sample
• Any 1960 SB score may suffer from extreme Flynn effect (e.g. if tested in 1972 with 1960 SB, FE of approximately 12 points)
The 1986 SB-IV had serious psychometric problems (Reynolds, 1987 & others)
• Underepresentative standardization sample (“far below industry standards”)
• “IQ roulette”
• “I believe the use of the S-B IV IQs to be logically indefensible, and I certainly would not want to defend their accuracy or validity in a court of law” (Reynolds, 1987; p. 141)
Other “You can’t fake smart” examples I have seen (not exhaustive list)
Unknown problems with some of the older “gold standards”
• WAIS-R norm sample for 16 to 19 year olds have been demonstrated to be suspect and “soft.”
Simply put, the WAIS-R norms for 16-19-year-olds are suspect and examiners should interpret [them] with extreme caution. The norms for 16-19-year-olds are ‘soft’ or ‘easy’ because the reference group performed more poorly than 16-to-19-year-olds really perform in the general population. The surprising result is that the IQs of 16- through 19-year-olds tested on the WAIS-R will be spuriously high by 3 to 5 points” (p. 85, italics added).
Kaufman (1990)
IQ Score Interpretations in Atkins Cases
Kevin S. McGrew, PhD
DirectorInstitute for Applied Psychometrics (IAP)
www.themindhumb.com