a comparison of methods of estimating subscale scores for...

55
A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests David Shin Pearson Educational Measurement May 2007 Using assessment and research to promote learning rr0701

Upload: others

Post on 03-Mar-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

A Comparison of Methods of

Estimating Subscale Scores

for Mixed-Format Tests

DavidShin

PearsonEducationalMeasurement

May2007

Using assessment and research to

promote learning rr0701

Page 2: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

Pearson Educational Measurement (PEM) is the most comprehensive provider

of educational assessment products, services and solutions. As a pioneer in

educational measurement, PEM has been a trusted partner in district, state

and national assessments for more than 50 years. PEM helps educators and parents

use assessment and research to promote learning and academic achievement.

PEM Research Reports provide dissemination of PEM research and

assessment-related articles prior to publication. PEM reports in .pdf format may be

downloaded at:

http://www.pearsonedmeasurement.com/research/research.htm

Page 3: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

2

Because the world is complex and resources are often limited, test scores often serve to

both rank individuals and provide diagnostic feedback (Wainer, Vevea, Camacho, Reeve,

Rosa, Nelson, Swygert, and Thissen, 2000). These two purposes create challenges from

the view of content coverage. Standardized achievement tests serve the purpose of

ranking very well. To serve the purpose of diagnosis, standardized achievement tests

must have clusters of items that yield interpretable scores. These clusters could be

learning objectives, subtests, or learning standards. Scores on these clusters of items will

be referred to as objective scores in this paper.

If there are a large number of items measuring an objective on a test, the

estimation of an objective score might be precise and reliable. However, in many cases

the number of items is fewer than optimal for the level of reliability desired. This

condition is problematic, but it often exists in practice (Pommerich, Nicewander, and

Hanson, 1999). The purpose of this paper is to review and evaluate a number of methods

that attempt to provide a more precise and reliable estimation of objective scores. The

first section of this paper reviews a number of different methods for estimating objective

scores. The second section of this paper presents a study evaluating a subset of the more

practical of these methods.

Review of Current Objective Score Estimation Methods

A number of studies have been devoted to proposing methods for the more

precise and reliable estimation of objective scores (Yen 1987; Yen, Sykes, Ito, and Julian,

1997; Bock, Thissen, and Zimowski, 1997; Pommerich et al., 1999; Wainer et al., 2000;

Gessaroli, 2004; Kahraman and Kamata, 2004; Tate, 2004; Shin, Ansley, Tsai, and Mao,

Page 4: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

3

2005). This section provides a review of these methods. All of these methods, implicitly

or explicitly, estimate the objective scores using collateral test information. This review

notes how different methods take advantage of collateral information in different ways.

The IRT domain score was used as the estimation of the objective score by Bock,

Thissen, and Zimowski (1997). The collateral test information is used in a sense that

when the item parameters are calibrated, the data from items in the other objectives will

actually contribute to the estimation of the item parameters in the objective of interest.

For example, if items 1 to 6 are measuring the first objective and items 7 to 12 are

measuring the second objective in the same test, in calibrating the test, the data from

items 7 to 12 actually contribute to the estimation of the item parameters of items 1 to 6.

Bock compared the proportion-correct score with IRT domain score computed by theta

from either maximum likelihood estimation (MLE) or Bayesian estimation and found that

the objective score estimated from the IRT estimation is more accurate than the

proportion-correct score.

Several studies extended the work of Bock, Thissen, and Zimowski (1997).

Pommerich et al. (1999) similarly used IRT domain score to estimate the objective score.

The only difference is that they did not estimate the score at an individual level but rather

at a group level. Tate (2004) extended Bock’s study to the multidimensional case with

different dimensionality and degree of correlation between subsets of test. The main

purpose of Tate’s study was to find a better method between MLE and the expected a

posteriori (EAP) estimation to estimate the objective score. He found that the choice of

estimation approach depends on the intended uses of the objective scores (Tate, 2000,

p.107).

Page 5: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

4

Adopting a somewhat different approach, Kahraman and Kamata (2004) also tried

to estimate the objective score using IRT models. Kahraman and Kamata used out-of-

scale items which is explicitly a kind of collateral test information to help the estimate of

the objective score. The out-of-scale items are the items that measure different objectives

but are in the same test as the objective of interest. For example, if the score of Objective

One is estimated, then the items in Objective Two, Three, and so on are out-of-scale

items. They controlled the number of out-of-scale items, correlation between objectives,

and the discrimination of items and found that the correlation between objectives needs to

be at least 0.5 for the moderate-discrimination out-of-scale items and 0.9 for the high-

discrimination items in order to take advantage of the out-of-scale items.

Wainer et al. (2000) used an empirical Bayesian (EB) method to compute the

objective score. The basic concept of this method is similar to Kelley’s (1927) regressed

score. Indeed, it is a multivariate version of Kelley’s method. In Kelley’s method, only

one test score was used to estimate the regressed score, but in Wainer’s method, several

other objective scores were used to estimate the objective score of interest.

Yen (1987) and Yen et al.(1997) combined the IRT and EB method to compute

the objective score using a method labeled Objective Performance Index (OPI). They

used the IRT domain score as the prior and assumed that the prior had a beta distribution.

With another assumption that the likelihood of the objective score follows a binomial

distribution, the posterior distribution is then also a beta distribution. The mean, standard

deviation (SD) of the posterior distribution are the estimated objective score and its

standard error.

Page 6: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

5

Gessaroli (2004) used multidimensional IRT (MIRT) to compute the objective

score. Gessaroli (2004) then compared it to the Wainer’s EB method (Wainer et al.,

2000). He found that the EB method had almost the same results as the MIRT methods.

Shin et al. (2005) applied the Markov chain Monte Carlo (MCMC) technique to

estimate the objective scores of IRT, OPI, and Wainer et al.’s methods. Shin et al. (2005)

then compared these MCMC versions to their original non-MCMC counterparts. They

found that the MCMC alternatives performed either the same or slightly better than the

non-MCMC methods.

Some of the methods reviewed in this paper may not be practical for use in a

large scale testing program. For example: the method of Pommerich et al. (1999) only

estimated the objective score for groups. It may not meet the need for a large scale test

that reports individual objective scores. The method of Gessaroli (2004) involved MIRT

that is rarely used in a large scale test. The method of Kahraman and Kamata (2004)

required certain conditions in order to take the advantage of this method and these

conditions may not be met in many large scale tests (p.417). The study of Tate (2004)

was very similar to Bock’s study (Bock, Thissen, and Zimowski, 1997). The MCMC IRT

and MCMC OPI method in Shin et al. (2005) are too time-consuming and hence may not

be practical.

However, other methods reviewed in this paper may be suitable for use in a large

scale testing program. The method of Yen et al. (1997) is actually currently used for

some state tests. The Bock, Thissen, and Zimowski (1997) method is convenient to be

implemented in the tests using IRT. The Wainer et al. (2000) method and the MCMC

Wainer methods of Shin et al. (2005) performed better than the other methods in the

Page 7: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

6

study of Shin et al. (2005). Therefore, these methods were included in the study reported

in the next section.

Evaluation of Selected Objective Score Estimation Methods

The study reported in this section compares five methods that use collateral test

information to estimate the objective score for a mixed-format test. These methods

include an adjusted version of Bock et al.’s item response theory (IRT) approach (Bock

et al., 1997), Yen’s objective performance index (OPI) approach (Yen, 1997), Wainer et

al.’s regressed score approach (Wainer et al., 2000), the Shin’s MCMC regressed score

approach (Shin et al., 2005), and the proportion-correct score approach. They are referred

as the Bock method, the Yen method, the Wainer method, the Shin method, and

proportion-correct method hereafter. In addition to comparing these five methods using a

common data set, the present study extends earlier work by including mixed-format tests.

Only Yen et al. (1997) consider the case of a mixed-format test. As more large scale tests

require the reporting of objective scores and the mixed-format tests become more

common, it is necessary to conduct a study that compares different objective score

estimation methods for mixed-format tests.

From previous studies (Yen, 1987; Yen et al., 1997; Wainer et al., 2000;

Pommerich et al., 1999; Bock et al., 1997, Shin et al. 2005), the number of items in an

objective, and the correlation between objectives were two main factors that affected the

estimation results. In the current study, the proportion of the polytomous items in a test

and the student sample size were also studied. The performance of the methods under

Page 8: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

7

different combinations of these four factors was compared. Six main questions were

investigated in this study:

(1) What is the order of the objective score reliabilities estimated from different methods

and how are they influenced by the four factors studied?

(2) What is the nominal 95% confident / credibility interval (95CI) of each method and

how are they influenced by the four factors studied?

(3) What is the order of the widths of the 95CI for each method and how are they

influenced by the four factors studied?

(4) What is the magnitude and order of the absolute bias (Bias) of different methods and

how are they influenced by the four factors studied?

(5) What is the magnitude and order of the standard deviation of estimation (SD) of

different methods and how are they influenced by the four factors studied?

(6) What is the magnitude and order of the root-mean-square error (RMSE) of different

methods and how are they influenced by the four factors studied?

The first question addresses the reliability of the objective score estimated by each

method. As mentioned previously, one of the reasons to use methods other than the

proportion-correct method is that the proportion-correct objective score is not reliable

given the limited number of items in each objective. Therefore, the objective score

estimated from the selected method must be more reliable than the proportion-correct

score. However, because some of the estimating methods such as the IRT method and

OPI method do not provide a way to estimate reliability, empirically it becomes

impossible to compare the reliability of each method. In the present study, since the true

score and the estimated score of each method are both available from the simulation

Page 9: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

8

process, it becomes possible to compute the correlation between the true score and the

estimated score and the reliability of each method can be obtained through the following

equation:

2 2 2

2 2

' 2 2 2 2 2( ) ( )T T T TX TX TX

XX TX

X X T X T X T

= = = = = (1)

Where XX’,, X, T, TX , and XT are the reliability, standard error of the estimated score,

standard error of the true score, covariance of the true and estimated score, and the

correlation between true score and estimated score, respectively.

The second question concerns the accuracy of the nominal 95% confidence /

credibility intervals (95CI) of each method. For the objective score estimated from

proportion-corrected method, the 95CI is the confidence interval. For the other methods,

the 95CI represents the credibility interval. Very often when an objective score is

reported, the 95CI is also presented. However, the nominal 95CI may not really cover the

true score 95% of the time. Through the simulation process in current study, the lower

and upper bound of each estimated objective score was computed and the actual

“percentage that the 95CI covers the true score”, percent coverage (PC), for each method

was then computed to answer this question.

The third question is about the widths of the 95CI for each method. It is possible

for a method to have a 95CI that covers the true score 95% of the time but has a wide

range of its 95CI. For example, a 95CI of the proportion-correct method may range from

4% to 95%. Obviously this range will cover the true proportion-correct score well

because the whole range of the proportion-correct is just from 0% to 100%. However,

such a 95CI is actually meaningless because its range is just too wide. Therefore, in this

Page 10: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

9

study the widths of the 95CI for each method were compared. It is necessary to consider

both the 95CI and its width at the same time in order to find a better estimating method.

The fourth to sixth questions are about the Bias, SD, and RMSE. They are defined

as follows:

| ( ) |,Bias E W= (2)

2[ ( )] ,SD E W E W= (3)

2 2 2( ) ,RMSE E W Bias SD= = + (4)

where W is the point estimator, and is the true parameter.

To summarize, six criteria, (1) reliability, (2) percent-coverage of true score for a

nominal 95% confidence/credibility interval (95PC), (3) the width of the 95%

confidence/credibility interval (95CI), (4) Bias, (5) SD, and (6) RMSE were used as the

dependent variables in the comparisons of the different methods. A more desirable

estimating method should have high score reliability, narrow 95CI, accurate percent-

coverage of a nominal 95CI, small SD, Bias that is close to zero, and consequently, small

RMSE.

Procedures

Simulated Responses

For the purpose of this study, data were simulated to represent different

conditions found in test data. Four factors are considered in generating the simulated

data. Detailed information about these four factors is provided in Table 1. In all, 3* 3 * 3

* 3 = 81 conditions were considered in the simulation study.

Page 11: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

10

Table 1. Simulation Factors and Number of Levels

Factors No. of levels Description

Number of examinees 3 250, 500, 1000

Test length 3 6, 12, or 18 items for each objective

Correlation between objectives 3 Approximately 1.0, 0.8, and 0.5

Ratio of CR/MC items 3 0, 20% or 50% (in number of items)

The present study used empirical item parameters from a state testing program.

Within this item pool there are 30 MC items and 18 CR items (each with three score

categories, 0, 1, and 2). The ability ( ) values for the examinees on different number of

objectives were simulated from a standardized multivariate normal distribution with

correlation coefficients equal to 1.0, 0.8, or 0.5.

For each of the 81 conditions, 100 simulation trials (i.e., response vectors) were

used. At each of the conditions, with item parameters assumed to be known, the

simulation process involved the following steps:

1. Generate ’s for each of the examinees from a standardized multivariate

normal distribution. The generated values were restricted to be between -3

and 3. The correlation coefficients between ’s were .5, .8 or 1.0.

2. Compute Pij( ) by using the IRT 3pl equation for item i in objective j and

Pijk( ) by using the generalized partial credit model equation for category k of

item i in objective j. The item parameters were randomly selected with

replacement from the item pool. Pij( ) and Pijk( ) are defined in the “Bock

method” section of this paper.

3. Use Pij( ) and Pijk( ) in step 2 to compute the true score for each objective.

For example, if items 1 through 6 were in objective 1, the true score of

Page 12: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

11

objective 1 was the total of the Pij( ) and weighted Pijk( ) of item 1 through 6.

This true objective score was used as the baseline to compare different

methods of estimating the objective score.

4. Generate response, yij using Pij( ) and Pijk( ) from step 3. yij is either a sample

from the binomial distribution with probability Pij( ) or a sample from the

multinomial distribution with probabilities equal to Pijk( ), where k is the

index for category level and ranges from 0 to 2.

5. Use the data from step 4 to estimate the objective scores using different

estimating methods. The details about each method are described later.

6. Repeat step 4 and step 5 for 100 times and compute objective score reliability,

95CI, percentage coverage of nominal 95CI, Bias, SD, and RMSE for further

analyses.

It should be noticed that in the simulation process the data were simulated

according to the correlation between ’s rather than the correlation between the objective

scores. However, because the objective score are a monotonically increasing function of

, the correlation between the objective scores is approximately equal to the correlation

between ’s.

Objective Scores

The simulated data were used as input for different estimating methods to

compute the objective scores. Then these estimated objective scores were used for

comparison of the estimating methods. The following are the brief descriptions of these

estimating methods.

Page 13: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

12

Bock method

The data from the simulation procedure were used to estimate the examinees’

values for the whole test. The estimated parameters, $ , were then entered into equations:

(1) $$

$

exp[1.7 ( )]( ) (1 )

1 exp[1.7 ( )]

ij ij

ij ij ij

ij ij

a bP c c

a b= +

+ , or (2) $

$

$

1

0 0

exp ( )

( )

exp ( )i

k

ij ij

ijk m c

ij ij

c

a b

P

a b

=

= =

= to

estimate Pij( ˆ ) and Pijk( ˆ ), where i, j, , k, and mi represent item, objective, score level

index, current computed score level, and total score levels for item i, respectively . The

objective scores for objective j, IRT T, were then computed by the equation,

$j

1

1IRT T ( ),

jI

ij

ijn =

= (5)

where i represents item, Ij is the number of items in objective j, and nj is the maximum

possible points in objective j. Note that nj equals1

( 1)jI

i

i

m=

.

For MC items,

$ $( ) ( )ij ijP= ; (6)

for CR items,

$ $

1

( ) ( 1) ( )im

ij ijk

k

k P=

= . (7)

Bayes estimates of scale scores, $ , were used to compute the IRT objective scores with a

normal (0, 1) as the prior distribution for the abilities. Usually when the objective scores

are estimated, the item parameters (i.e., the item pool) already exist. Therefore, in this

study, the item parameters were assumed known.

The variance of the IRT T can be expressed in the equation below:

Page 14: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

13

$ $ $ $

j

2

2

1 1 1 1

IRT T 2

( ) 1 ( ) ( 1) ( ) ( 1) ( )

VAR ,

MC CR i in n m m

ij ij ijk ijk

i i k k

j

P P k P k P

n

= = = =

+

= (8)

where nmc, and ncr are the number of MC and CR items, respectively, and nj represents, as

defined previously, the maximum possible points in the objective j.

Yen method

The following steps were used to estimate Yen’s T (Yen et al., 1997):

1. Estimate IRT item parameters for all selected items

2. Estimate for the whole test (including all objectives).

3. For each objective, calculate IRT Tj (see Equation 1 on page 5), �jT , where j

represents objective j.

4. Obtain

� �

2

1

( ).

(1 )

j

j

xJ jj n

j jj

n TQ

T T=

= (9)

If 2 ( ,.10),Q J> then Yen Tj, �j

j

j

xT

n= ,

,j jp x= (10)

and

j j jq n x= (11)

, where xj is the observed score obtained in objective j, nj is the max points can be

obtained in objective j, and J is the number of objectives.

If 2 ( ,.10),Q J then

� *jj j jp T n x= + (12)

Page 15: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

14

, and

� *(1 ) .j j j j jq T n n x= + (13)

The Yen Tj , �

jT , is defined to be �� *

*

jj j jj

j j j j

p T n xT

p q n n

+= =

+ +, where

� �

$ $$

$

*

2

( , )

2

1 1

1

( | )[1 ( | )]1

( | )

1 1( ) 1 ( ) 1.

1'( )

j j

j

j j

jj

I I

I

ij ijI

i ij j

ij

ij

T Tn

T

n n

n

μ μ

= =

=

=

(14)

For MC items:

$

$ $1.7 1 ( ) ( )'( )

(1 )

ij ij ij ij

ij

ij

a P P c

c= (15)

and

$

$

$ $

$

$

$

2

1

22 2

1

'( )( , )

( ) 1 ( )

1.7 1 ( ) ( ) ,

1( )

MC

MC

nij

i ij ij

nij ij ij ij

i ijij

IP P

a P P c

cP

=

=

=

=

(16)

where nMC is the total number of MC items in the whole test.

For CR items:

$ $ $

$ ${ }1

22

1

'( ) ( ) ( 1) ( )

( 1) ( ) ( )

i

i

m

ij ij ij ij

k

m

ij ijk ij

k

a k

a k P

=

=

=

=

(17)

Page 16: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

15

and

$

$

$ $

$ $

$ $

$ $

2

21 2

1

22

2 2

1

21 2

1

22 2

1

'( )( , )

( 1) ( ) ( )

( 1) ( ) ( )

=

( 1) ( ) ( )

( 1) ( ) ( )

CR

i

i

CR

i

i

nij

mi

ijk ij

k

m

ij ijk ijnk

mi

ijk ij

k

m

ij ijk ij

k

I

k P

a k P

k P

a k P

=

=

=

=

=

=

=

=1

,CRn

i=

(18)

where nMC is the total number of CR items in the whole test. The definition of $( )ij is in

Equations (6) and (7).

The variance of Yen T can be expressed in the equation below:

j Yen T 2

Var ,( ) ( 1)

j j

j j j j

p q

p q p q=

+ + + (19)

where pj and qj are defined in Equations (10) to (13).

Wainer method

In vector notation, for the multivariate situation involving several objective scores

collected in the vector x,

REG T = x. + â(x. - x) . (20)

x. is the mean vector which contains the means of each objective involved. is a matrix

that is the multivariate analog for the estimated reliability for each objective. All that is

needed to calculate REG T are estimates of . The equation for calculating is:

= Strue

( Sobs

)-1

, (21)

Page 17: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

16

where Sobs

is the observed variance-covariance matrix of the objective scores and Strue

is

the variance-covariance matrix of the true objective scores.

Because errors are uncorrelated with true scores, it is easy to see that

' 'jv jv jv jvx x= . Where 'jv jv and

'jv jvx x are the off-diagonal elements of true

and obs

, the

population variance-covariance matrices of the true objective scores and observed

objective scores. It is in the diagonal elements of Sobs

and Strue

that the difference arises.

However, if the diagonal elements of Sobs

are multiplied by the reliability, 2 2/ x , of the

subscale in question, the results are the diagonal elements of Strue

. It is customary to

estimate reliability with Cronbach’s coefficient (Wainer et al., 2000).

The score variance of the estimates for the vth objective is the vth diagonal

element of the matrix,

1( )true obs true

=C S S S . (22)

Shin method

Instead of using the empirical Bayes approach in the regressed score method, the

MCMC regressed score method uses a fully Bayesian model. Here is an example with

two objectives for the illustration of the full Bayesian estimation. If there are two

objectives called M and V, the observed scores and true scores of these two objectives are

xm, xv, m and v, respectively. The fully Bayesian model results in the posterior

distributions of ( m |xm, xv), MREG Tm, and ( v |xm, xv), MREG Tv.

From classical test theory (CTT):

pm pm pmx = + , (23)

and

.pv pv pvx = + (24)

Page 18: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

17

where p represents examinee.

These can be re-parameterized via CTT equations as:

2, (0, ),pm pm pm pm mx N= + � (25)

where

2, (0, );pm m pm pm mNμ= + � (26)

and

2, (0, )pv pv pv pv vx N= + � , (27)

where

2, (0, )pv v pv pv vNμ= + � . (28)

pm and pv are the error terms. μ pm and μ pv are the “common” true scores for

objective m and objective v. pm and pv are the “unique” true scores for examinee p on

objective m and objective v.

In the fully Bayesian model, instead of using the maximum likelihood estimate

(MLE) like the empirical Bayesian method did to replace the unknown parameters, it is

necessary to give a prior distribution on the unknown parameters such as 2

m , 2

v , 2

m ,

2

v , and mv .

If it is assumed that

2 2 2

2 2 2

2 2

2 2

N ,

pm m m m mv m mv

pv v mv v v mv v

pm m m mv m mv

pv v mv v mv v

x

x

μ

μ

μ

μ

+

+� , (29)

then

1 2 1

, 21 11 21 11 12| N + ,pm m

pm pm pv m m

pv v

xx x

x

μμ

μ� , (30)

Page 19: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

18

and

1 2 1

, 31 11 31 11 12

x| N + ,

x

pm m

pv pm pv v v

pv v

x xμ

μμ

� , (31)

where

2 2

11 2 2,m m mv

mv v v

+=

+ (32)

2

12 22 2,m mv

mv v

= = (33)

2

21 ,m mv= (34)

and

2

31 mv v= . (35)

The priors of mμ , vμ , 2

m , 2

v , 2

m , 2

m , and mv need to be specified. This

model was coded with WinBUGS to estimate the full Bayesian regressed objective

scores. The variance of the estimation is the variance of the posterior distribution and the

95% credibility interval equals to the interval between the .025 and .975 quartiles of the

posterior distribution.

Results

The results of the comparison of each method are presented in Figures 1 to 24.

Figures 1 to 4 are the results of the comparison of the estimated reliability under different

factors (the number of examinees, the number of items in each objective, the correlation

between objectives, and the ratio of CR/MC items). For brevity, if not specially

mentioned, the reliability in the following text refers to the estimated reliability. Figures

Page 20: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

19

5 to 8 are the results of comparing the width of 95CI for each method. Figure 9 to 12 are

the results of the comparison of the percentage coverage of 95CI for each method.

Figures 13 to 16 are the results of the comparison of the absolute bias under different

factors (the number of examinees, the number of items in each objective, the correlation

between objectives, and the ratio of CR/MC items). Figures 17 to 20 are the results of

comparing the SD for each method. Figure 21 to 24 are the results of the comparison of

the percentage RMSE for each method.

Comparison of Reliability

Number of examinees

1000500250

Rel

iabi

lity

.80

.78

.76

.74

.72

.70

Methods

Bock

Yen

Wainer

Shin

Proportion-correct

Figure 1. Reliability for different number of examinees

Figure 1 is the comparison of reliability for different number of examinees. From

Figure 1 it can be seen that:

Page 21: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

20

(1) The number of examinees did not impact the reliability of the objective score

because the reliability for different numbers of examinees remains approximately

the same except for the Bock method. However, the largest reliability in Bock

method is .74 which is only about .03 higher than the lowers reliability, .71, for

this method.

(2) The order of the magnitude of the objective score reliability for each method is:

Shin = Wainer > Yen > Bock > proportion-correct. The objective score estimated

from the Shin and Wainer method have the highest reliability.

Number of Items in each objective

18126

Rel

iabi

lity

.9

.8

.7

.6

.5

Methods

Bock

Yen

Wainer

Shin

Proportion-correct

Figure 2. Reliability for different number of items in each objective

Figure 2 is the comparison of reliability for different number of items in each objective.

From Figure 2 it can be seen that:

Page 22: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

21

(1) The number of items in each objective had impact on the reliability of the

objective score because the reliability for different number of items per objective

increased as more items were in each objective. The slope of the increasing was

greater from 6 to 12 items and then became flatter from 12 to 18 items.

(2) The order of the magnitude of the reliability of objective score for each method is:

Shin = Wainer > Yen > Bock > proportion-correct. The objective scores estimated

from the Shin and Wainer methods have the highest reliability.

(3) For the Bock’s method, the reliability became the same for the 12 and 18 item

cases.

Correlation between objectives

1.00.80.5

Rel

iabi

lity

.9

.8

.7

.6

.5

Methods

Bock

Yen

Wainer

Shin

Proportion-correct

Figure 3. Reliability for different correlation between objectives

Page 23: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

22

Figure 3 is the comparison of reliability for different correlation between objectives.

From Figure 3 it can be seen that:

(1) The correlation between objectives had impact on the reliability of the objective

score because the reliability increased as the correlation became higher. However,

the effect is not shown in the proportion-correct method.

(2) The order of the magnitude of the reliability of objective score for each method is

approximately: Shin = Wainer > Yen > proportion-correct. The objective scores

estimated from the Shin and Wainer methods have the highest reliability.

(3) The Bock’s method had a unique performance in this case. When the correlation

between objectives equaled to .5, it had the lowest reliability and when the

correlation equaled to one, it had the highest reliability. This finding is related to

the assumption of the Bock method. Because Bock’s objective score is actually

the IRT domain score, it needs to satisfy the assumption of IRT which requires

each objective to test the same thing (i.e., the unidimensionality assumption).

Therefore, in the case when the correlation between objectives is 0.5, the

assumption is violated resulting in it the worst method to estimate the objective

score. However, when the correlation was 1.0, the assumption was met and it was

the best method for the estimating.

Page 24: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

23

Ratio of CR/MC items (in number of items)

0.50.20

Rel

iabi

lity

.82

.80

.78

.76

.74

.72

.70

.68

.66

Methods

Bock

Yen

Wainer

Shin

Proportion-correct

Figure 4. Reliability for different ratio of CR/MC items

Figure 4 is the comparison of reliability for different ratio of CR/MC items in each

objective. From Figure 4 it can be seen that:

(1) The ratio of CR/MC items in each objective had impact on the reliability of the

objective score because the reliability for different ratio of CR/MC items

increased as the ratio of CR/MC items in each objective increased.

(2) The order of the magnitude of the reliability of objective score for each method is:

Shin = Wainer > Yen > Bock > proportion-correct. The objective scores estimated

from the Shin Wainer methods have the highest reliability.

(4) Because the CR items used in this study have the same number of categories (3

categories), more CR items means more possible points can be obtained in each

Page 25: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

24

objective. That is, if there are more possible points in each objective, the

reliability of estimated objective scores will be higher.

To sum up: as the number of items per objective, the correlation between objectives,

and the ratio of CR/MC items increased, the reliability of the estimated objective

score also increased. The number of examinees did not affect the reliability of the

objective score. Within these 5 methods studied, generally the estimated objective

scores from the Shin and Wainer methods have the highest reliability except for the

case when the correlation between objectives is 1.0. In that case, the Bock method

yielded the highest reliability.

Comparison of the Width of 95CI

Number of examinees

1000500250

Wid

th o

f 95

CI

.6

.5

.4

.3

.2

Methods

Bock

Yen

Wainer

Shin

Proportion-correct

Figure 5. Width of 95CI for different number of examinees

Page 26: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

25

Figure 5 is the comparison of the width of 95CI for different number of examinees.

From Figure 5 it can be seen that:

(1) The number of examinees did not have impact on the width of 95CI.

(2) The order of the magnitude of the width of 95CI for each method is: Proportion-

correct > Bock > Shin = Wainer > Yen.

Number of items in each objective

18126

Wid

th o

f 95

CI

.8

.7

.6

.5

.4

.3

.2

.1

Methods

Bock

Yen

Wainer

Shin

Proportion-correct

Figure 6. Width of 95CI for different number of items in each objective

Figure 6 is the comparison of the width of 95CI for different number of items per

objective. From Figure 6 it can be seen that:

(1) The number of examinees had impact on the width of 95CI. As the number of

items per objective increased, the width of 95CI decreased.

Page 27: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

26

(2) The order of the magnitude of the width of 95CI for each method is: Proportion-

correct > Bock > Shin = Wainer > Yen.

Correlation between objectives

1.00.80.5

Wid

th o

f 95

CI

.6

.5

.4

.3

.2

Methods

Bock

Yen

Wainer

Shin

Proportion-correct

Figure 7. Width of 95CI for different correlation between objectives

Figure 7 is the comparison of the width of 95CI for different correlation between

objectives. From Figure 7 it can be seen that:

(1) The correlation between objectives did not have impact on the width of 95CI for

the Bock, proportion-correct, and Yen methods; but has slight effect on the

Wainer and Shin methods. For these two methods, as the correlation between

objectives increased, the width of 95CI slightly decreased.

(2) The order of the magnitude of the width of 95CI for each method is: Proportion-

correct > Bock > Shin = Wainer > Yen.

Page 28: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

27

Ratio of CR/MC items (in number of items)

0.50.20

Wid

th o

f 95

CI

.6

.5

.4

.3

.2

Methods

Bock

Yen

Wainer

Shin

Proportion-correct

Figure 8. Width of 95CI for different ratio of CR/MC items

Figure 8 is the comparison of the width of 95CI for different ratio of CR/MC items.

From Figure 8 it can be seen that:

(1) The ratio of CR/MC items did not have impact on the width of 95CI except for

the Bock method. For Bock methods, as the number of items per objective

increased, the width of 95CI slightly decreased.

(2) The order of the magnitude of the width of 95CI for each method is: Proportion-

correct > Bock > Shin = Wainer > Yen.

To sum up: within the four factors studied, only the number of items per objective had

impact on the width of 95CI. More items per objective tended to lead to a narrower width

Page 29: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

28

of 95CI. The order of the magnitude of the width of 95CI is consistent across different

factors. The Yen method has the narrowest 95CI.

Comparison of the Percent Coverage of 95CI

Number of examinees

1000500250

Per

cent

age

cove

rage

of 95

CI

102

100

98

96

94

92

90

Methods

Bock

Yen

Wainer

Shin

Proportion-correct

Figure 9. Percent coverage of 95CI for different number of examinees

Figure 9 is the comparison of the percent coverage of 95CI for different number of

examinees. From Figure 9 it can be seen that:

(1) Generally, the number of examinees did not have impact on the percent coverage

of 95CI.

(2) Almost all the methods have a conservative 95CI because their percentage

coverage of 95CI are all larger than 95%. That means, the nominal 95CI covers

the true score more than 95% of the time.

Page 30: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

29

Number of Items in each objective

18126

Per

cent

age

cove

rage

of 95

CI

102

100

98

96

94

92

Methods

Bock

Yen

Wainer

Shin

Proportion-correct

Figure 10. Percent coverage of 95CI for different number of items in each objective

Figure 10 is the comparison of the percent coverage of 95CI for different number of

items in each objective.

From Figure 10 it can be seen that:

(1) For the Bock and Yen methods, the number of items in each objective had impact

on the percent coverage of 95CI. Generally, as the number of items per objective

increased the percent coverage of 95CI decreased.

(2) Almost all the methods have a conservative 95CI because their percentage

coverage of 95CI are all larger than 95%. That means, the nominal 95CI covers

the true score more than 95% of the time.

Page 31: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

30

Correlation between objectives

1.00.80.5

Per

cent

age

cove

rage

of 95

CI

102

100

98

96

94

92

Method

Bock

Yen

Wainer

Shin

Proportion-correct

Figure 11. Percent coverage of 95CI for different correlation between objectives

Figure 11 is the comparison of the percent coverage of 95CI for different correlation

between objectives.

From Figure 11 it can be seen that:

(1) For the Bock and Yen methods, the correlation between objectives has an obvious

impact on the percent coverage of 95CI.

(2) Almost all the methods have a conservative 95CI because their percentage

coverage of 95CI are all larger than 95%. That means, the nominal 95CI covers

the true score more than 95% of the time.

Page 32: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

31

Ratio of CR/MC items (in number of items)

0.50.20

Per

cent

age

cove

rage

of 95

CI

102

100

98

96

94

92

90

Methods

Bock

Yen

Wainer

Shin

Proportion-correct

Figure 12. Percent coverage of 95CI for different ratio of CR/MC items

Figure 12 is the comparison of the percent coverage of 95CI for different ratio of

CR/MC items. From Figure 12 it can be seen that:

(1) For the Yen methods, the CR/MC ratio has some impact on the percent coverage

of 95CI but the pattern is not consistent across situations.

(2) Almost all the methods have a conservative 95CI because their percentage

coverage of 95CI are all larger than 95%. That means, the nominal 95CI covers

the true score more than 95% of the time.

To sum up: the 95CI for most of the methods except for Yen method are conservative.

That is, the nominal 95CI covered the true score more than 95% of the time in the

Page 33: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

32

simulation study. Yen method has the 95CI that is more close to its nominal value, 95%.

This may due to the non-symmetrical property of the 95CI for Yen’s method.

Comparison of Bias

Number of examinees

1000500250

Bia

s

.06

.05

.04

.03

.02

.01

0.00

Method

Bock

Yen

Wainer

Shin

Proportion-correct

Figure 13. Bias for different number of examinees

Figure 13 is the comparison of bias for different number of examinees. From

Figure 13 it can be seen that:

(1) The number of examinees did not impact the Bias of the objective score because

the Bias for different number of examinees remains approximately the same for

each method.

Page 34: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

33

(2) The order of the Bias for each method was: Bock > Wainer > Shin > Yen >

proportion-correct. The magnitudes of Bias ranged from approximately 0.01 to

0.05. That is, if the perfect score is 100, the Bias is around 1 to 5 points.

Number of items in each objective

18126

Bia

s

.07

.06

.05

.04

.03

.02

.01

0.00

Method

Bock

Yen

Wainer

Shin

Proportion-correct

Figure 14. Bias for different number of items in each objective

Figure 14 is the comparison of bias for different number of items in each objective. From

Figure 14 it can be seen that:

(1) The number of items in each objective had impact on the Bias of the objective

scores for Wainer, Shin and Yen methods because the Bias for different number

of items per objective increased as more items were in each objective. The slope

of the increasing was greater from 6 to 12 items and then became flatter from 12

to 18 items. The number of items did not affect the Bias for proportion-correct

method. For the Bock method, the Bias decreased as the number of items

Page 35: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

34

increased from 6 to 12, but increased slightly as the number of items increased

from 12 to 18.

(2) The order of the Bias for each method was approximately: Bock > Wainer > Shin

> Yen > proportion-correct.

Correlation between objectives

1.00.80.5

Bia

s

.10

.08

.06

.04

.02

0.00

Method

Bock

Yen

Wainer

Shin

Proportion-correct

Figure 15. Bias for different correlation between objectives

Figure 15 is the comparison of bias for different correlation between objectives. From

Figure 15 it can be seen that:

(1) The correlation between objectives had impact on the Bias for each method

because the Bias increased as the correlation became higher. However, the effect

was not shown in the proportion-correct method.

Page 36: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

35

(2) The order of the Bias for each method was approximately: Bock > Wainer > Shin

> Yen > proportion-correct.

(3) The Bock’s method had a unique performance in this case. When the correlation

between objectives equaled to .5, it had the highest Bias but when the correlation

equaled to one, it had the relatively low Bias. This finding is related to the

assumption of the Bock method. Because Bock’s objective score is actually the

IRT domain score, it needs to satisfy the assumption of IRT which requires each

objective to test the same thing (i.e., the unidimensionality assumption).

Therefore, in the case when the correlation between objectives is 0.5, the

assumption is violated resulting in it the worst method to estimate the objective

score. However, when the correlation was 1.0, the assumption was met and it

made the Bias lower.

Ratio of CR/MC items (in number of items)

0.50.20

Bia

s

.06

.05

.04

.03

.02

.01

0.00

Method

Bock

Yen

Wainer

Shin

Proportion-correct

Page 37: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

36

Figure 16. Bias for different ratio of CR/MC items

Figure 16 is the comparison of bias for different ratio of CR/MC items in each objective.

From Figure 4 it can be seen that:

(1) The ratio of CR/MC items in each objective did not impact the Bias of the

objective score

(2) The order of the Bias for each method was approximately: Bock > Wainer > Shin

> Yen > proportion-correct.

To sum up: the Bias was affected by two factors: the number of items per objective

and the correlation between objectives. As the number of items per objective or the

correlation between objectives increased, the Bias of the estimated objective score

also increased. The number of examinees and the ratio of CR/MC items did not affect

the Bias of the objective score. Within these 5 methods studied, the order of the

magnitude of Bias is consistent across four factors. Generally, the order was: Bock >

Wainer > Shin > Yen > proportion-correct. The magnitude of the Bias approximately

ranged from 0.01 to 0.08.

Comparison of the SD

Page 38: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

37

Number of examinees

1000500250

SD

.13

.12

.11

.10

.09

.08

.07

.06

Method

Bock

Yen

Wainer

Shin

Proportion-correct

Figure 17. SD for different number of examinees

Figure 17 is the comparison of the SD for different number of examinees. From

Figure 17 it can be seen that:

(1) The number of examinees did not impact the SD.

(2) The order of the SD for each method is: Proportion-correct > Yen > Shin >

Wainer > Bock.

Page 39: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

38

Number of item in each objective

18126

SD

.18

.16

.14

.12

.10

.08

.06

.04

Method

Bock

Yen

Wainer

Shin

Proportion-correct

Figure 18. SD for different number of items in each objective

Figure 18 is the comparison of the SD for different number of items per objective.

From Figure 18 it can be seen that:

(1) The number of examinees had impact on the SD. As the number of items per

objective increased, the SD decreased.

(2) Generally, the order of the SD for each method is: Proportion-correct > Yen >

Shin > Wainer > Bock.

Page 40: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

39

Correlation between objectives

1.00.80.5

SD

.13

.12

.11

.10

.09

.08

.07

.06

Method

Bock

Yen

Wainer

Shin

Proportion-correct

Figure 19. SD for different correlation between objectives

Figure 19 is the comparison of the SD for different correlation between objectives.

From Figure 19 it can be seen that:

(1) The correlation between objectives had impact on the SD except for proportion-

correct method. For the other methods, as the correlation between objectives

increased, the SD decreased.

(2) The order of the SD for each method is: Proportion-correct > Yen > Shin >

Wainer > Bock.

Page 41: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

40

Ration of CR/MC items (in number of items)

0.50.20

SD

.14

.13

.12

.11

.10

.09

.08

.07

.06

Method

Bock

Yen

Wainer

shin

Proportion-correct

Figure 20. SD for different ratio of CR/MC items

Figure 20 is the comparison of the SD for different ratio of CR/MC items. From

Figure 20 it can be seen that:

(1) The ratio of CR/MC items did not impact the SD.

(2) The order of the SD for each method is: Proportion-correct > Yen > Shin >

Wainer > Bock. .

To sum up: within the four factors studied, only the number of items per objective and

the correlation between objectives had impact on the SD. More items per objective or

higher correlation between objectives tended to lead to a narrower SD. The order of the

SD is consistent across different factors. The order is: Proportion-correct > Yen > Shin >

Wainer > Bock. The magnitude of the SD ranged approximately from 0.07 to 0.16.

Comparison of the RMSE

Page 42: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

41

Number of examinees

1000500250

RM

SE

.13

.12

.11

.10

.09

.08

Methods

Bock

Yen

Wainer

Shin

Proportion-correct

Figure 21. RMSE for different number of examinees

Figure 21 is the comparison of the RMSE for different number of examinees. From

Figure 21 it can be seen that:

(1) Generally, the number of examinees did not impact the RMSE.

(2) The order of the RMSE for each method is: Proportion-correct > Yen > Bock >

Shin Wainer.

Page 43: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

42

Number of items in each objective

18126

RM

SE

.18

.16

.14

.12

.10

.08

.06

Method

Bock

Yen

Wainer

Shin

Proportion-correct

Figure 22. RMSE for different number of items in each objective

Figure 22 is the comparison of the RMSE for different number of items in each

objective.

From Figure 22 it can be seen that:

(1) The number of items in each objective had impact on the RMSE for each method.

Generally, as the number of items per objective increased the RMSE decreased.

(2) The order of the RMSE for each method is: Proportion-correct > Yen > Bock >

Shin Wainer.

Page 44: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

43

Correlation between objectives

1.00.80.5

RM

SE

.13

.12

.11

.10

.09

.08

.07

Method

Bock

Yen

Wainer

Shin

Proportion-correct

Figure 23. RMSE for different correlation between objectives

Figure 23 is the comparison of the RMSE for different correlation between

objectives.

From Figure 23 it can be seen that:

(1) Except for the proportion-correct method, the correlation between objectives had

impact on the RMSE. As the correlation between objectives increased the RMSE

decreased.

(2) The order of the RMSE for each method was generally: Proportion-correct > Yen

> Bock > Shin Wainer.

(3) It should be noticed that the Bock method was greatly affected by the correlation

between objectives. When the correlation equals to 0.5, the assumption of Bock

method is violated very much and its RMSE became high. However, when the

Page 45: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

44

correlation equals to 1.0, the Bock assumption is perfectly met and its RMSE

became the lowest.

Ratio of CR/MC items (in number of items)

0.50.20

RM

SE

.14

.13

.12

.11

.10

.09

.08

Method

Bock

Yen

Wainer

Shin

Proportion-correct

Figure 24. RMSE for different ratio of CR/MC items

Figure 24 is the comparison of the RMSE for different ratio of CR/MC items. From

Figure 24 it can be seen that:

(3) The CR/MC ratio had slightly impact on the RMSE for each method. As the ratio

increased, the RMSE slight decreased.

(4) The order of the RMSE for each method was generally: Proportion-correct > Yen

> Bock > Shin Wainer.

To sum up: within the four factors studied, only the number of items per objective and

the correlation between objectives had impact on the RMSE. More items per objective or

Page 46: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

45

higher correlation between objectives tended to lead to a smaller RMSE. The order of the

SD is consistent across different factors. The order of the RMSE for each method was

generally: Proportion-correct > Yen > Bock > Shin Wainer. The magnitude of the

RMSE ranged approximately from 0.08 to 0.16.

Discussion

Based on the results from this simulation study, it appears that using estimation

methods other than the proportion-correct method improved the estimation of objective

scores in terms of the reliability of the objective score. Factors that affect the reliability of

objective scores included the number of items per objective, the correlation between

objectives, and the ratio of CR/MC items. The number of examinees did not affect the

reliability of the objective score. As the value of these factors increased, the reliability of

the estimated objective score also increase. Only the number of items per objective

affected the 95CI and the percent coverage of 95CI. Generally, the objective scores

estimated from the Wainer and Shin methods had the highest reliability. They ranged

approximately from 0.68 to 0.83. The objective scores estimated from the Yen and

proportion-correct methods had relatively lower reliability. The range of reliabilities was

from 0.59 to 0.75 for the proportion-correct method and from 0.65 to 0.80 for the Yen

method. The studied factors seem to have had a larger impact on the Bock method,

especially the correlation between objectives. When the correlation was 1.0, the Bock

method had the highest reliability (around 0.85).

Table 2 shows the effective gain in number of items for each method versus the

proportion-correct method using the Spearman-Brown prophesy formula. It can be seen

from Table 2 that the use of the Wainer and Shin methods can lead to an effective gain of

Page 47: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

46

up to 1.63 times the number of items in a subscore. The Bock method may yield an

effective gain up to 1.89 times of the number of items in an objective. For example, if

there are 6 items in an objective, using the Wainer and Shin methods may obtain the

reliability as if there are 9.78 (6 * 1.63 = 9.78) items in that objective compared to the

proportion-correct method. In other words, to achieve the score reliability of the

Wainer/Shin methods with the proportion-correct method, the number of items per

objective must be increased by a factor of 1.63.

Table 2. The Effective Gained in Number of Items for each method

Methods Wainer / Shin Yen Bock Proportion-

correct

Original reliability min 0.68 0.65 0.58 0.59

max 0.83 0.80 0.85 0.75

Number of items Gained min 1.48 1.29 0.96 1.00

max 1.63 1.33 1.89 1.00

In the percentage scale, the widths of the 95CI are approximately 27%, 35%,

50%, and 55% for the Yen method, Wainer and Shin methods, Bock method, and

proportion-correct method, respectively. That means if an objective score is given based

on the Yen method, 95% of the time the true objective score will fall in the range with

the width equal to 27%. Similarly, 95% of the time a score based on the Wainer and Shin

methods will fall in a range with width equal to 35% , and so on. Basically, the narrower

the width is, the more precise the estimation will be. However, this needs to be evaluated

with the percentage coverage of the nominal width of CI (confidence or credibility

interval). Generally, a good estimator should have a narrow 95CI and an accurate

Page 48: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

47

percentage coverage of the nominal CI. By combining the results of the 95CI width and

the percentage coverage of 95CI, it can be seen that Yen’s method provided a narrow

95CI and an relatively accurate percentage coverage of 95CI. The other methods are all

too conservative because their 95CI covered the true value more than 95% of the time.

Therefore, an adjusted 95CI value should be developed and studied in order to obtain a

more precise 95CI for these methods.

In general, using estimation methods other than the proportion-correct method

improved the estimation of objective scores in terms of SD and RMSE. The proportion-

correct method had the smallest Bias but the largest SD. Because the magnitude of Bias is

much smaller than that of SD, the RMSE of the proportion-correct method became the

largest. Therefore, the use of the other methods is expected to estimate the objective

scores better than the proportion-correct method.

Although the Bock method had the smallest SD, it also had the largest Bias.

Compared to the Bock method, the Wainer and Shin methods had SD’s similar to the

Bock method, but had smaller Bias than the Bock method. Therefore, the Wainer and

Shin methods had slightly smaller RMSE than the Bock method.

Factors that affected the RMSE, SD, and Bias of objective scores included the

number of items per objective and the correlation between objectives. The number of

examinees and the ratio of CR/MC items did not affect the RMSE, SD, and Bias of

objective scores.

Table 3 shows the maximum and minimum RMSE values for each method. Using

the methods other than the proportion-correct method reduces the RMSE by as much as

0.055, which is 33 percent of the proportion-correct RMSE. .

Page 49: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

48

Table 3. The Maximum and Minimum RMSE

Values for each method

Method Wainer/Shin Yen Bock Proportion-correct

Max 0.11 0.12 0.11 0.165

Min 0.08 0.09 0.075 0.13

From the results of this study, it can be seen that the Wainer and Shin methods

yield objective scores that have the highest reliability. Therefore, theoretically, these two

methods may be considered as the better methods. It is then necessary to compare these

two methods in more detail. These two methods have similar performance in reliability,

95CI, and percent coverage of 95CI. They differ in three ways: (1) the capability to

compute the conditional standard error of measurement (CSEM), (2) the capability to

improve the estimation as more prior information is available, and (3) the complexity of

computation. From Equations (19), (27) and (28), it can be seen that the computation of

CSEM is possible for Shin method but not for Wainer method. For the Wainer method,

only an overall SEM for the whole group can be obtained.

As for the capability to improve the estimation as more prior information is

available, it can be seen from the results that the number of examinees (within the scope

of the present study) will not affect the estimation of results. That means that additional

student data may not be able to improve the estimation. However, if a state testing

program is running for more than one year, it is possible that data from the previous year

can be used as the prior to reduce the size of the credibility intervals in estimating

objective scores. The Bayesian approach, which is the basis of Shin method, lends itself

Page 50: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

49

naturally to sequential algorithms. That is, the posterior after the ith data point becomes

the prior for the i+1 data point and so on.

For example, Figure 13 shows the effect of the sequential algorithm for 5 years

(i.e. using the data of year 1 as the prior for year 2, year 2 for year 3, etc.). In Figure 13,

the line labeled as rmse-0 represents the root-mean square error (RMSE) of the test that

has 0% of CR items, and the label rmse-5 represents the RMSE of the test that the ratio of

CR/MC items is 50%. The X-axis represents the years and the Y-axis represents the value

of RMSE in proportion-correct scale. That is, the value 0.12 means 12%. It can be seen

from Figure 13 that as the prior changed from year 1 to later years, the RMSE based on

the Shin method decreased until year four. After that the RMSE remained the same. This

result implies that if more information can contribute to the specification of the prior

distribution, the Shin method may perform better than the Wainer method.

Figure 13. the root-mean square error of objective scores

0.05

0.06

0.07

0.08

0.09

0.1

0.11

0.12

0 2 4 6

Year

RM

SE

rmse-0

rmse-5

Page 51: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

50

The price for the Shin method is its complexity of computation. Although the

model used in this method is relatively simple, when conducting MCMC sampling, more

computation time is required than the other methods. For a test with 20,000 students and

7 objectives, it would take approximately 18 minutes to compute the objective scores

with Shin methods. One way to overcome this may be to compute the 21 , 31 , and

11 in equation (29), (31), and (32), respectively, from the past years’ data and then it

will be able to estimate the weights in Equations (27) and (28) for the computation of the

objective score. In this way, the estimation time for the objective score will be greatly

reduced. Further research about accuracy of this approach needs to be conducted.

Conclusion, Limitation, and Future Study

From the results of this study, using collateral test information to estimate

objective scores did increase the reliability and reduce the width of 95CI compared to the

proportion-correct method. Within the methods studied, the Wainer and Shin methods

yielded objective scored that had the highest reliability. The Yen method had the

narrowest 95CI and relatively accurate percent coverage of 95CI. The factors that

affected reliability were the number of items per objective, the correlation between

objectives, and the CR/MC ratio. As the value of these factor increased, the reliability of

objective scores also increased. Only the number of items per objective affected the width

of 95CI and the percentage coverage of 95CI. The more items per objective, the narrower

the width 95CI was. Also, using collateral test information to estimate objective scores

did reduce the RMSE compared to the proportion-correct method. Within the methods

studied, the Wainer and Shin methods yielded objective score that had the lowest RMSE.

The Bock method performed best when the correlation between objectives was equal to

Page 52: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

51

one. The factors that affected the results of Bias, SD, and RMSE were the number of

items per objective and the correlation between objectives. The more items per objective

or the higher the correlation between objectives, the smaller the Bias, SD, and RMSE

were.

In the present study, only simulation data was used to compare the performance of

different methods. Therefore, the results will generalize best to situations similar to those

studied in this paper. Since there may be various conditions in the empirical situation

such as the skew of student ability distribution or extreme item parameters, more studies

using the empirical data will be needed. In the simulation process, only fixed correlations

between objectives and only fixed numbers of categories was used. Practically, these

variables should be studied under more mixed and complex situations. A more complex

design of simulation study may be beneficial for future study. Since some state testing

programs use different IRT models such as Rasch model for MC items and Partial Credit

model for polytomous items, simulation under these models may be needed for those

programs. The MCMC method is still time-consuming compared to the other methods.

Part of the reason is that the available software, WinBUGS, is not optimized for the

method used in this study. If the MCMC method is used operationally, a more efficient

programming language could be used. From the results of this study, the 95CI for the

Wainer and Shin methods are too conservative. Therefore, a future study to adjust their

95CI would be beneficial. Under the policy of NCLB, the growth of the students’ overall

achievement is a hot topic in the educational research field. However, not many pay

attention to the growth of students’ achievement in the subscale level. Therefore, after a

reliable estimate of objective score is available, the next step should consider how to link

Page 53: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

52

the objective scores from year to year or even create a vertical scale on the subscale base

for measuring the growth. Research on the objective scores seems important and there are

still many unanswered questions that need to be studied.

Page 54: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

53

References

Bock, R. D., Thissen, D., & Zimowski, M. F.(1997). IRT estimation of domain scores.

Journal of educational measurement, 34,197-211.

Gessaroli, M. E. (2004). Using hierarchical multidimensional item response theory to

estimate augmented subscores. Paper presented at the annual meeting of the National

Council on Measurement in Education, San Diego, CA.

Kahraman, N., & Kamata, A. (2004). Increasing the precision of subscale scores by using

out-of-scale information. Applied psychological measurement, 28(6), 407-426.

Kolen, M. J., & Brennan, R. L. (2004). Test equating: Methods and practices( 2nd

ed).

New York: Springer-Verlag.

Kelley, T. L. (1927). The interpretation of educational measurements. New York: World

Book.

Patz, R. J., & Junker, B. W. (1999b). Applications and extensions of MCMC in IRT:

Multiple item types, missing data, and rated responses. Journal of educational and

behavioral statistics. 24, 342-366.

Pommerich, M., Nicewander, W. A., & Hanson, B. (1999). Estimating average domain

scores. Journal of educational measurement, 36, 199-216.

Shin, C. D., Ansley, T., Tsai, T., & Mao X. (2005). A comparison of methods of

estimating objective scores. Paper presented at the annual meeting of the National

Council on Measurement in Education, Montreal, Quebec, Canada.

Tate, R. L. (2004). Implications of multidimensionality for total score and subscore

performance. Applied measurement in education, 17(2). 89-112.

Wainer, H., Vevea, J. L., Camacho, F., Reeve III, B. B., Rosa, K., Nelson, L., Swygert,

K. A., & Thissen, D. (2000). Augmented scores—“borrowing strength” to compute

Page 55: A Comparison of Methods of Estimating Subscale Scores for …images.pearsonassessments.com/images/tmrs/tmrs_rg/... · 2013. 4. 25. · Bock compared the proportion-correct score with

54

scores based on small numbers of items. In D. Thissen, & H. Wainer (Ed.), Test

scoring. (pp. 343-387). Hillsdale, NJ: Earlbaum Associates.

Yen, W. M. (1987). A Bayesian / IRT index of objective performance. Paper presented at

the annual meeting of the Psychometric Society, Montreal, Quebec, Canada, June 1-

19.

Yen, W. M., Sykes, R. C., Ito, K., & Julian, M. (1997). A Bayesian / IRT index of

objective performance for tests with mixed-item types. Paper presented at the annual

meeting of the National Council on Measurement in Education in Chicago.