combining scores multi item scales

63
1 How to combine scores across multiple questions to form a total scale score (modified and shortened, from Chapter 19, Warner, 2007) 19.6 Methods for the Computation of Summated Scales 19.6.1 Implicit Assumption: All Items Measure the Same Construct and Are Scored in the Same Direction When we add together scores on a list of measures or questions, we implicitly assume that all these scores measure the same underlying construct and that all the questions or items are scored in the same direction. Consider the first assumption, the assumption that all items measure the same construct. What information would be obtained by a set of numbers that measured completely unrelated things? A sum of X 1 height, X 2 agreeableness, and X 3 number of pairs of shoes owned by a person would be a meaningless number because the scores that are combined are not measures of the same underlying latent variable. In general, it does not make sense to summarize information across a set of X measured variables by summing them unless they are highly correlated with each other, and both the pattern of correlations and the

Upload: kevin-cannon

Post on 15-Oct-2014

91 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Combining Scores Multi Item Scales

1

How to combine scores across multiple questions to form a total scale score (modified

and shortened, from Chapter 19, Warner, 2007)

19.6 Methods for the Computation of Summated Scales

19.6.1 Implicit Assumption: All Items Measure the Same Construct and Are Scored in

the Same Direction

When we add together scores on a list of measures or questions, we implicitly assume that

all these scores measure the same underlying construct and that all the questions or items are

scored in the same direction.

Consider the first assumption, the assumption that all items measure the same construct.

What information would be obtained by a set of numbers that measured completely unrelated

things? A sum of X1 height, X2 agreeableness, and X3 number of pairs of shoes owned

by a person would be a meaningless number because the scores that are combined are not

measures of the same underlying latent variable. In general, it does not make sense to

summarize information across a set of X measured variables by summing them unless they

are highly correlated with each other, and both the pattern of correlations and the nature of

the items are consistent with the interpretation that all the individual X items are slightly

different ways of measuring the same underlying latent variable (e.g., depression).

The items included in psychological tests such as the CESD scale are typically written so

that they assess slightly different aspects of a complex variable such as depression (e.g., low

self-esteem, fatigue, sadness). To evaluate empirically whether these items can reasonably be

interpreted as measures of the same underlying latent variable or construct, we look for

reasonably large correlations among the scores on the items. If the scores on a set of

measurements or test items are highly correlated with each other, this evidence is consistent

with the belief that the items may all be measures of the same underlying construct. However,

Page 2: Combining Scores Multi Item Scales

2

high correlations among items can arise for other reasons and are not necessarily proof that

the items measure the same underlying construct; for example, they may occur due to

sampling error or may arise because the items have some kind of measurement artifact in

common, such as a strong social desirability bias. The most widely reported method of

evaluating reliability for summated scales, Cronbach’s’s alpha, is based on the mean of the

inter-item correlations.

19.6.2 Reverse-Worded Questions

Consider the second assumption: the assumption that all items are scored in the same

direction. In the CESD scale in the appendix to this chapter, most of the items are worded in

such a way that a higher score indicates a greater degree of depression. For example, for

Question 3, “I felt that I could not shake off the blues even with help from my family or

friends,” the response that corresponds to 4 points (“I felt this way most of the time, 5–7 days

per week”) indicates a higher level of depression than the response that corresponds to 1

point (“I felt this way rarely or none of the time”). However, a few of the items (Numbers 4,

8, 12, and 16) are reverse worded. Question 4 asks how frequently the respondent “felt that I

was just as good as other people.” The response to this question that would indicate the

highest level of depression corresponds to the lowest frequency of occurrence (1 Rarely or

none of the time). When reverse-worded items are included in a multiple-item measure, the

scoring on these items must be recoded before we sum scores across items, such that a high

score on every item corresponds to the same thing, that is, a higher level of depression.

When I name my SPSS variables, I generally give names that help me to remember what

scale each question belongs to, which item number, and whether or not it is reverse scored.

So for example when my survey included the 20 item (question) CESD scale I named the

items dep1, dep2, dep3, etc. However, when a question is reverse worded and needs to be

Page 3: Combining Scores Multi Item Scales

3

recoded before it is used in a reliability analysis (such as Cronbach’s’s alpha) or summed

with other items, I initially give the variable a name such as “revdepd4”.

When self-report methods are used, it is often desirable to include some reverse-worded

questions. Self-report responses are prone to many types of bias, including yea-saying or

nay-saying bias (some respondents tend to agree or disagree with all items), and social

desirability bias (many people tend to report behaviors and attitudes that they believe are

socially desirable); see Converse and Presser (1999) for further discussion. To avoid the yea-

saying bias, some scales include reverse-worded items. For example, the CESD scale

includes statements about feelings and behaviors, and respondents are asked to rate how

frequently they experience each of these, using a scale from 1 (rarely or none of the time, less

than 1 day a week) to 4 (most or all of the time, 5–7 days a week).

It is generally preferable to report final scores for a scale scored in a direction such that a

higher score corresponds to “more” of the attitude or ability that the test is supposed to

measure. For example, it is easier to talk about scores on a depression scale, and to interpret

correlations of the depression scale with other variables, if a higher score corresponds to

more severe depression. (If a depression scale was scored such that a high score corresponded

to a low level of depression, then scores on the depression scale would correlate negatively

with other measures of negative mood such as anxiety; this would be confusing for the data

analyst and the reader.) Most of the items on the CESD scale are worded such that a high

frequency of reported occurrence corresponds to a higher level of depression. For example, a

high reported frequency of occurrence for the item “I had crying spells” corresponds to a

higher level of depression. However, a few of the CESD scale items were reverse worded, for

example, “I enjoyed life.” For these reverse-worded items, a score of 1 or 2 indicating a low

frequency of occurrence corresponds to a higher level of depression. Before combining

Page 4: Combining Scores Multi Item Scales

4

scores across items that are worded in different directions (such that for some items, a high

score corresponds to more depression, and for other items, a low score corresponds to more

depression), it is necessary to recode the direction of scoring on reverse-worded items so that

a higher score always corresponds to a higher level of depression. Items 4, 8, 12, and 16 in

the appendix were reverse worded. Scores on these reverse-worded items must be recoded

when we form a sum of the scores across all 20 items to serve as an overall measure of

depression.

In the following example, revdep4 is the name of the SPSS variable that corresponds to

the reverse-worded depression item “I felt that I was just as good as other people” (item

number 4 on the CESD scale). One simple method to reverse the scoring on this item (so that

a higher score corresponds to more depression) is as follows: Create a new variable (dep4)

that corresponds to 5 revdep4. If you take a value that is one unit higher than the highest

possible score on a measure (in this case, because the possible scores are 1, 2, 3, and 4, we

use the value 5), and then subtract each person’s score from that reference value, this reverses

the direction of scoring. This can be done in SPSS by making the following menu selections:

<Transform> <Compute>.

In the dialog box for the Compute procedure (see Figure 19.5), the name of the new

variable or Target Variable (dep4) is placed in the left-hand side box. The equation to

compute a score for this new variable as a function of the score on an existing variable is

placed in the right-hand side box titled Numeric Expression (in this case, the numeric

expression is 5 revdep4).

Insert Figure 19.5

Figure 19.5 Computing a Reverse-Scored Variable for Dep4

It is also helpful to create a variable with a different name for the reverse-coded score for

Page 5: Combining Scores Multi Item Scales

5

each item (e.g., dep4 is the reverse-coded score on revdep4). If you change the direction of

scoring by changing the original values and retain the original variable name (as in this

example, dep4 5 dep4), it is easy to lose track of which items have and have not already

been reverse scored.

Based on a preliminary examination of the data, the researcher evaluates whether the two

assumptions required for simple summated scales are satisfied (i.e., scores on all the items are

positively intercorrelated, and it makes sense to interpret all the items as measures of the

same underlying construct or variable).

19.6.3 Sum of Raw Scores

After recoding any reverse-worded items, you can create a total score for each scale by

summing scores across items as shown in Figure 19.7. In this first example, a score for

selected items from the CESD scale was computed by summing the scores on Items 1

through 5 (with Item 4 reverse scored). The <Transform> and <Compute> menu selections

open the SPSS Compute dialog window that appears in Figure 19.7. The name of the new

variable (in this example, briefcesd) is placed in the left-hand side window under the Target

Variable. The equation that specifies which scores are summed is placed in the Numeric

Expression window. To form a score that is the sum of items named dep1 to dep5 (but using

dep4 instead of revdep4), you can use the following numeric expression:

briefcesd dep1 dep2 dep3 dep4 dep5. (19.6)

Insert Figure 19.7

Figure 19.7 Computation of a Brief 5-Item Version of the Depression Scale

If an individual has a missing score on one or more individual items, use of this

computation: briefcesd dep1 dep2 dep3 dep4 dep5. will result in a system missing

code for the new scale total score. In this dataset, one participant had a system missing code

Page 6: Combining Scores Multi Item Scales

6

on revdep4 and dep4; therefore, the number of scores is reduced from N 98 in the entire

SPSS data file to N 97 for analyses that involve the variable briefcesd. If you want to

obtain a score for people who have missing values on some items, you can use the “MEAN”

function in the SPSS Compute dialog window (see Figure 19.8); this returns the mean score,

based on all non-missing items. For example, if a person is missing a score on dep2, the

numeric expression mean(dep1, dep2, dep3, dep4, dep5) will return the mean for all

availablel scores on Items dep1, dep3, dep4, and dep5. If you want to put the total score back

into the units that you would have obtained by summing items, multiply this mean by the

number of items in the scale (in this case, the number of items was 5).

19.6.4 Sum of z Scores

Summing raw scores may be reasonable when the items are all scored using the same

response alternatives or all measured in the same units. However, there are occasions when

researchers want to combine information across variables that are measured in quite different

units. Suppose a sociologist wants to create an overall index of socioeconomic status (SES)

by combining information about the following measures: annual income in dollars, years of

education, and occupational prestige rated on a scale from 0 to 100. If raw scores (in dollars,

years, and points) were summed, the value of the total score would be dominated by the value

of annual income. If we want to give these three factors (income, education, and occupational

prestige) equal weight when we combine them, we can convert each variable to a z score or

standard score and, then, form a unit-weighted composite of these z scores:

ztotal zX1 zX2 … zXp. (19.7)

To create a composite of z scores on income, education, and occupational prestige so as to

summarize information about SES, you could compute SES zincome zeducation zoccupationprestige.

You could also use the Mean function to obtain a mean of z scores for the items in a scale.

Page 7: Combining Scores Multi Item Scales

7

19.7 Assessment of Internal Homogeneity for Multiple-Item Measures

The internal consistency reliability of a multiple-item scale tells us the degree to which the

items on the scale measure the same thing. If the items on a test all measure the same

underlying construct or variable, and if all items are scored in the same direction, then the

correlations among all the items should be positive.

19.7.2 Cronbach’s Alpha Reliability Coefficient: Conceptual Basis

We can summarize information about positive intercorrelations between the items on a

multiple-item test by calculating a Cronbach’s alpha reliability. The Cronbach’s alpha has

become the most popular form of reliability assessment for multiple-item scales. As seen in

an earlier section, as we sum a larger number of items for each participant, the expected value

of ei approaches 0, while the value of p × T increases. In theory, as the number of items (p)

included in a scale increases, assuming other characteristics of the data remain the same, the

reliability of the measure (the size of the p × T component compared with the size of the e

component) also increases. The Cronbach’s alpha provides a reliability coefficient that tells

us, in theory, how reliable our estimate of the “stable” entity that we are trying to measure is,

when we combine scores from p test items (or behaviors or ratings by judges). The

Cronbach’s alpha uses the mean of all the inter-item correlations (for all pairs of items or

measures) to assess the stability or consistency of measurement.

The Cronbach’s alpha can be understood as a generalization of the Spearman-Brown

prophecy formula; we calculate the mean inter-item correlation (r–) to assess the degree of

agreement among individual test items, and then, we predict the reliability coefficient for a p-

item test from the correlations among all these single-item measures. Another possible

interpretation of the Cronbach’s alpha is that it is, essentially, the average of all possible split

half reliabilities. Here is one formula for the Cronbach’s from Carmines and Zeller (1979,

Page 8: Combining Scores Multi Item Scales

8

p. 44):

(19.11)

where p is the number of items on the test and r– the mean of the inter-item correlations.

The size of the Cronbach’s alpha depends on the following two factors:

As p (the number of items included in the composite scale) increases, and assuming that

r– stays the same, the value of the Cronbach’s alpha increases.

As r– (the mean of the correlations among items or measures) increases, assuming that

the number of items p remains the same, the Cronbach’s alpha increases.

It follows that we can increase the reliability of a scale by adding more items (but only if

doing so does not decrease r–, the mean inter-item correlation) or by modifying items to

increase r– (either by dropping items with low item-total correlations or by writing new items

that correlate highly with existing items). There is a trade-off: If the inter-item correlation is

high, we may be able to construct a reasonably reliable scale with few items, and of course, a

brief scale is less costly to use and less cumbersome to administer than a long scale. Note that

all items must be scored in the same direction prior to summing. Items that are scored in the

opposite direction relative to other items on the scale would have negative correlations with

other items, and this would reduce the magnitude of the mean inter-item correlation.

Researchers usually hope to be able to construct a reasonably reliable scale that does not

have an excessively large number of items. Many published measures of attitudes or

personality traits include between 4 and 20 items for each trait. Ability or achievement tests

(such as IQ) may require much larger numbers of measurements to produce reliable results.

Note that when the items are all dichotomous (such as true/false), the Cronbach’s alpha

may still be used to assess the homogeneity of response across items. In this situation, it is

Page 9: Combining Scores Multi Item Scales

9

sometimes called a Kuder-Richardson 20 (KR-20) reliability coefficient. However, the

Cronbach’s alpha is not appropriate for use with items that have categorical responses with

more than two categories.

19.7.3 Cronbach’s Alpha for Five Selected CESD Scale Items

Ninety-seven students filled out the 20-item CESD scale (items shown in the appendix to

this chapter) as part of a survey. The names given to these 20 items in the SPSS data file that

appears in Table 19.2 were dep1 to dep20. Questions 4, 8, 12, and 16 were reverse worded,

and therefore, it was necessary to recode the scores on these items. The recoded values were

placed in variables with the names dep4, dep8, dep12, and dep16. The SPSS reliability

procedure was used to assess the internal consistency reliability of their responses. The value

of the Cronbach’s alpha is an index of the internal consistency reliability of the depression

score formed by summing the first 5 items. In this first example, only the first 5 items (dep1,

dep2, dep3, dev4, and dep5) were included. To run SPSS reliability, the following menu

selections were made, starting from the top level menu for the SPSS data worksheet (see

Figure 19.11): <Analyze> <Scale> <Reliability>.

The reliability procedure dialog box appears in Figure 19.12. The names of the 5 items on

the CESD scale were moved into the variable list for this procedure. The Statistics button was

clicked to request additional output; the Reliability Analysis: Statistics window appears in

Figure 19.13. In this example, “Scale if item deleted” in the “Descriptives for” box and

“Correlations” in the “Inter-Item” box were checked. The syntax for this procedure appears in

Figure 19.14, and the output appears in Figure 19.15.

Insert Figure 19.11

Figure 19.11 SPSS Menu Selections for the Reliability Procedure

Insert Figure 19.12

Page 10: Combining Scores Multi Item Scales

10

Figure 19.12 SPSS Reliability Analysis for 5 CESD Scale Items: Dep1, Dep2, Dep3,

Dep4, and Dep5

Insert Figure 19.13

Figure 19.13 Statistics Selected for SPSS Reliability Analysis

Insert Figure 19.14

Figure 19.14 SPSS Syntax for Reliability Analysis

Insert Figure 19.15

Figure 19.15 SPSS Output From the First Reliability Procedure

NOTE: Scale: BriefCESD.

The Reliability Statistics panel in Figure 19.15 reports two versions of the Cronbach’s

alpha statistic for the entire scale including all 5 items. For the sum dep1 dep2 dep3

dep4 dep5, the Cronbach’s alpha estimates the proportion of the variance in this total that is

due to p × T, the part of the score that is stable or consistent for each participant across all 5

items. A score can be formed by summing raw scores (the sum of dep1, dep2, dep3, dep4,

and dep5), z scores, or standardized scores (zdep1 zdep2 … zdep5). The first value, .59, is

the reliability for the scale formed by summing raw scores; the second value, .61, is the

reliability for the scale formed by summing z scores across items. In this example, these two

versions of the Cronbach’s alpha (raw score and standardized score) are nearly identical.

They generally differ from each other more when the items that are included in the sum are

measured using different scales with different variances (as in the earlier example of an SES

scale based on a sum of income, occupational prestige, and years of education).

Recall that the Cronbach’s alpha, like other reliability coefficients, can be interpreted as a

proportion of variance. Approximately 60% of the variance in the total score for depression,

which is obtained by summing the z scores on Items 1 through 5 from the CESD scale, is

Page 11: Combining Scores Multi Item Scales

11

shared across these 5 items. A Cronbach’s reliability coefficient of .61 would be considered

unacceptably poor reliability in most research situations. Subsequent sections describe two

different things researchers can do that may improve the Cronbach’s alpha reliability:

deleting poor items or increasing the number of items.

A correlation matrix appears under the heading “Inter-Item Correlation Matrix.” This

reports the correlations between all possible pairs of items. If all items measure the same

underlying construct, and if all items are scored in the same direction, then all the correlations

in this matrix should be positive and reasonably large. Note that the same item that had a

small loading on the depression factor in the preceding FA (trouble concentrating) also

tended to have low or even negative correlations with the other 4 items. The Item-Total

Statistics table shows how the statistics associated with the scale formed by summing all five

items would change if each individual item were deleted from the scale. The Corrected Item-

Total Correlation for each item is its correlation with the sum of the other 4 items in the scale;

for example, for dep1, the correlation of dep1 with the “corrected total” (dep2 dep3 dep4

dep5) is shown. This total is called “corrected” because the score for dep1 is not included

when we assess how dep1 is related to the total. If an individual item is a “good” measure,

then it should be strongly related to the sum of all other items in the scale; conversely, a low

item-total correlation is evidence that an individual item does not seem to measure the same

construct as other items in the scale. The item that has the lowest item-total correlation with

the other items is, once again, the question about trouble concentrating. This low item-total

correlation is yet another piece of evidence that this item does not seem to measure the “same

thing” as the other 4 items in this scale.

The last column in the Item-Total Statistics table reports Cronbach’s’s Alpha if Item

Deleted; that is, what is the Cronbach’s alpha for the scale if each individual item is deleted?

Page 12: Combining Scores Multi Item Scales

12

For the item that corresponded to the question trouble concentrating, deletion of this item

from the scale would increase the Cronbach’s to .70. Sometimes the deletion of an item

that has low correlations with other items on the scale results in an increase in reliability. In

this example, we can obtain slightly better reliability for the scale if we drop the item trouble

concentrating, which tends to have small correlations with other items on this depression

scale; the sum of the remaining 4 items has a Cronbach’s of .70, which represents slightly

better reliability.

19.7.4 Improving Cronbach’s Alpha by Dropping a “Poor” Item

The SPSS reliability procedure was performed on the reduced set of 4 items: dep1, dep2,

dep3, and dep4. The output from this second reliability analysis (in Figure 19.16) shows that

the reduced 4-item scale had Cronbach’s reliabilities of .703 (for the sum of raw scores)

and .712 (for the sum of z scores). A review of the column headed “Cronbach’s’s Alpha if

Item Deleted” in the new Item-Total Statistics table indicates that the reliability of the scale

would become lower if any additional items were deleted from the scale. Thus, we have

obtained slightly better reliability from the 4-item version of the scale (Figure 19.16) than for

a 5-item version of the scale (Figure 19.15). The 4-item scale had better reliability because

the mean inter-item correlation was higher after the item trouble concentrating was deleted.

Insert Figure 19.16

Figure 19.16 Output for the Second Reliability Analysis: Scale Reduced to Four Items

NOTE: Item trouble concentrating has been dropped.

19.7.5 Improving the Cronbach’s Alpha by Increasing the Number of Items

Other factors being equal, Cronbach’s alpha reliability tends to increase as p, the number

of items in the scale, increases. For example, we obtain a higher Cronbach’s alpha when we

use all 20 items in the full-length CESD scale than when we examine just the first 5 items.

Page 13: Combining Scores Multi Item Scales

13

The output from the SPSS reliability procedure for the full 20-item CESD scale (with Items

4, 8, 12, and 16 reverse scored) appears in Figure 19.17. For the full scale formed by

summing scores across all 20 items, the Cronbach’s was .88.

Insert Figure 19.17

Figure 19.17 SPSS Output: Cronbach’s Alpha Reliability for the 20-Item CESD Scale

19.7.6 A Few Other Methods of Reliability Assessment for Multiple-Item Measures

19.7.6.1 Split-Half Reliability

A split-half reliability for a scale with p items is obtained by dividing the items into two

sets (each with p/2 items). This can be done randomly or systematically; for example, the first

set might consist of odd-numbered items and the second set might consist of even-numbered

items. Separate scores are obtained for the sum of the Set 1 items (X1) and the sum of the Set

2 items (X2), and a Pearson r (r12) is calculated between X1 and X2. However, this r12

correlation between X1 and X2 is the reliability for a test with only p/2 items; if we want to

know the reliability for the full test that consists of twice as many items (all p items, in this

example), we can “predict” the reliability of the longer test using the Spearman-Brown

prophecy formula (Carmines & Zeller, 1979):

(19.12)

where r12 is the correlation between the scores based on split-half versions of the test

(each with p/2 items), and rXX is the reliability for a score based on all p items.

Depending on the way in which items are divided into sets, the value of the split-half

reliability can vary. The Cronbach’s alpha can be interpreted as the mean of all possible

different split-half reliabilities.

19.7.6.2 Parallel Forms Reliability

Sometimes it is desirable to have two versions of a test that include different questions

Page 14: Combining Scores Multi Item Scales

14

but that yield comparable information; these are called parallel forms. Parallel forms of a test,

such as the Eysenck Personality Inventory, are often designated Form A and Form B. Parallel

forms are particularly useful in repeated measures studies where we would like to test some

ability or attitude on two occasions, but we want to avoid repeating exactly the same

questions. Parallel forms reliability is similar to split-half reliability, except that when

parallel forms are developed, more attention is paid to matching items so that the two forms

contain similar types of questions. For example, consider Eysenck’s Extraversion scale. Both

Form A and Form B include similar numbers of items that assess each aspect of extraversion

—for instance, enjoyment of social gatherings, comfort in talking with strangers, sensation

seeking, and so forth. A Pearson r between scores on Form A and Form B is a typical way of

assessing reliability; in addition, however, a researcher wants scores on Form A and Form B

to yield the same means, variances, and so forth, so these should also be assessed.

19.9 Validity Assessment

Validity of a measurement essentially refers to whether the measurement really measures

what it purports to measure. In psychological and educational measurement, the degree to

which scores on a measure correspond to the underlying construct that the measure is

supposed to assess is called construct validity. (Some textbooks used to list construct

validity as one of several types of measurement validity; in recent years, many authors use

the term construct validity to subsume all the forms of validity assessment described below.)

For some types of measurement (such as direct measurements of simple physical

characteristics), validity is reasonably self-evident. If a researcher uses a tape measure to

obtain information about people’s heights (whether the measurements are reported in

centimeters, inches, feet, or other units), the researcher does not need to go to great lengths to

persuade readers that this type of measurement is valid. However, there are many situations

Page 15: Combining Scores Multi Item Scales

15

where the characteristic of interest is not directly observable, and researchers can only obtain

indirect information about it. For example, we cannot directly observe intelligence (or

depression); but we may infer that a person is intelligent (or depressed) if he or she gives

certain types of responses to large numbers of questions that researchers agree are diagnostic

of intelligence (or depression). A similar problem arises in medicine, for example, in the

assessment of blood pressure. Arterial blood pressure could be measured directly by shunting

the blood flow out of the person’s artery through a pressure measurement system, but this

procedure is invasive (and generally, less invasive measures are preferred). The commonly

used method of blood pressure assessment uses an arm cuff; the cuff is inflated until the

pressure in the cuff is high enough to occlude the blood flow; a human listener (or a

microphone attached to a computerized system) listens for sounds in the brachial artery while

the cuff is deflated. At the point when the sounds of blood flow are detectable (the Korotkoff

sounds), the pressure on the arm cuff is read, and this number is used as the index of systolic

blood pressure—that is, the blood pressure at the point in the cardiac cycle when the heart is

pumping blood into the artery. The point of this example is that this common blood pressure

measurement method is quite indirect; research had to be done to establish that measurements

taken in this manner were highly correlated with measurements obtained more directly by

shunting blood from a major artery into a pressure detection system. Similarly, it is possible

to take satellite photographs and use the colors in these images to make inferences about the

type of vegetation on the ground, but it is necessary to do validity studies to demonstrate that

the type of vegetation that is identified using satellite images corresponds to the type of

vegetation that is seen when direct observations are made at ground level.

As these examples illustrate, it is quite common in many fields (such as psychology,

medicine, and natural resources) for researchers to use rather indirect assessment methods—

Page 16: Combining Scores Multi Item Scales

16

either because the variable in question cannot be directly observed or because direct

observation would be too invasive or too costly.

In cases such as these, whether the measurements are made through self-report

questionnaires, by human observers, or by automated systems, validity cannot be assumed;

we need to obtain evidence to show that measurements are valid.

For self-report questionnaire measurements, two types of evidence are used to assess

validity. One type of evidence concerns the content of the questionnaire (content or face

validity); the other type of evidence involves correlations of scores on the questionnaire with

other variables (criterion-oriented validity).

19.9.1 Content and Face Validity

Both content and face validity are concerned with the content of the test or survey items.

Content validity involves the question whether test items represent all theoretical dimensions

or content areas. For example, if depression is theoretically defined to include low self-

esteem, feelings of hopelessness, thoughts of suicide, lack of pleasure, and physical

symptoms of fatigue, then a content-valid test of depression should include items that assess

all these symptoms. Content validity may be assessed by mapping out the test contents in a

systematic way and matching them to elements of a theory or by having expert judges decide

whether the content coverage is complete.

A related issue is whether the instrument has face validity; that is, does it appear to

measure what it says it measures? Face validity is sometimes desirable, when it is helpful for

test takers to be able to see the relevance of the measurements to their concerns, as in some

evaluation research studies where participants need to feel that their concerns are being taken

into account.

If a test is an assessment of knowledge (e.g., knowledge about dietary guidelines for

Page 17: Combining Scores Multi Item Scales

17

blood glucose management for diabetic patients), then content validity is crucial. Test

questions should be systematically chosen so that they provide reasonably complete coverage

of the information (e.g., What are the desirable goals for the proportions and amounts of

carbohydrate, protein, and fat in each meal? When blood sugar is tested before and after

meals, what ranges of values would be considered normal?).

When a psychological test is intended for use as a clinical diagnosis (of depression, for

instance), clinical source books such as the Diagnostic and Statistical Manual of Mental

Disorders (DSM-IV) might be used to guide item selection, to ensure that all relevant facets

of depression are covered. More generally, a well-developed theory (about ability,

personality, mood, or whatever else is being measured) can help a researcher map out the

domain of behaviors, beliefs, or feelings that questions should cover to have a content-valid

and comprehensive measure.

However, sometimes, it is important that test takers should not be able to guess the

purpose of the assessment, particularly in situations where participants might be motivated to

“fake good,” “fake bad,” lie, or give deceptive responses. There are two types of

psychological tests that (intentionally) do not have high face validity: projective tests and

empirically keyed objective tests. One well-known example of a projective test is the

Rorschach test, in which people are asked to say what they see when they look at ink blots; a

diagnosis of psychopathology is made if responses are bizarre. Another is the Thematic

Apperception Test, in which people are asked to tell stories in response to ambiguous

pictures; these stories are scored for themes such as need for achievement and need for

affiliation. In projective tests, it is usually not obvious to participants what motives are being

assessed, and because of this, test takers should not be able to engage in impression

management or faking. Thus, projective tests intentionally have low face validity.

Page 18: Combining Scores Multi Item Scales

18

Some widely used psychological tests were constructed using empirical keying methods;

that is, test items were chosen because the responses to those questions were empirically

related to a psychiatric diagnosis (such as depression), even though the question did not

appear to have anything to do with depression. For example, persons diagnosed with

depression tend to respond “False” to the MMPI (Minnesota Multiphasic Personality

Inventory) item “I sometimes tease animals”; this item was included in the MMPI depression

scale because the response was (weakly) empirically related to a diagnosis of depression,

although the item does not appear face valid as a question about depression (Wiggins, 1973).

Face validity can be problematic; people do not always agree about what underlying

characteristic(s) a test question measures. Gergen, Hepburn, and Fisher (1986) demonstrated

that when items taken from one psychological test (the Rotter Internal/External Locus of

Control scale) were presented to people out of context and people were asked to say what

trait they thought the questions assessed, they generated a wide variety of responses.

19.9.2 Criterion-Oriented Types of Validity

Content validity and face validity are assessed by looking inside a test to see what

material it contains and what the questions appear to measure. Criterion-oriented validity is

assessed by examining correlations of scores on the test with scores on other variables that

should be related to it if the test really measures what it purports to measure. If the CESD

scale really is a valid measure of depression, for example, scores on this scale should be

correlated with scores on other existing measures of depression that are thought to be valid,

and they should predict behaviors that are known or theorized to be associated with

depression.

19.9.2.1 Convergent Validity

Convergent validity is assessed by checking to see if scores on a new test of some

Page 19: Combining Scores Multi Item Scales

19

characteristic X correlate highly with scores on existing tests that are believed to be valid

measures of that same characteristic. For example, do scores on a new brief IQ test correlate

highly with scores on well-established IQ tests such as the WAIS or the Stanford-Binet? Are

scores on the CESD scale closely related to scores on other depression measures such as the

BDI? If a new measure of a construct has reasonably high correlations with existing measures

that are generally viewed as valid, this is evidence of convergent validity.

19.9.2.2 Discriminant Validity

Equally important, scores on X should not correlate with things the test is not supposed to

measure (discriminant validity). For instance, researchers sometimes try to demonstrate that

scores on a new test are not contaminated by social desirability bias by showing that these

scores are not significantly correlated with scores on the Crown-Marlowe Social Desirability

scale or other measures of social desirability bias.

19.9.2.3 Concurrent Validity

As the name suggests, concurrent validity is evaluated by obtaining correlations between

scores on the test with current behaviors or current group memberships. For example, if

persons who are currently clinically diagnosed with depression have higher mean scores on

the CESD scale than persons who are not currently diagnosed with depression, this would be

one type of evidence for concurrent validity.

19.9.2.4 Predictive Validity

Another way of assessing validity is to ask whether scores on the test predict future

behaviors or group membership. For example, are scores on the CESD scale higher for

persons who later commit suicide than for people who do not commit suicide?

19.9.3 Construct Validity: Summary

Many types of evidence (including content, convergent, discriminant, concurrent, and

Page 20: Combining Scores Multi Item Scales

20

predictive validity) may be required to establish that a measure has strong construct validity

—that is, that it really measures what the test developer says it measures, and it predicts the

behaviors and group memberships that it should be able to predict. Westen and Rosenthal

(2003) suggested that researchers should compare a matrix of obtained validity coefficients or

correlations with a target matrix of predicted correlations and compute a summary statistic to

describe how well the observed pattern of correlations matches the predicted pattern. This

provides a way of quantifying information about construct validity based on many different

kinds of evidence.

Although the preceding examples have used psychological tests, validity questions

certainly arise in other domains of measurement. For example, referring to the example

discussed earlier, when the colors in satellite images are used to make inferences about the

types and amounts of vegetation on the ground, are those inferences correct? Indirect

assessments are sometimes used because they are less invasive (e.g., as discussed earlier, it is

less invasive to use an inflatable arm cuff to measure blood pressure) and sometimes because

they are less expensive (broad geographical regions can be surveyed more quickly by taking

satellite photographs than by having observers on the ground). Whenever indirect methods of

assessment are used, validity assessment is required.

Multiple-item assessments of some variables (such as depression) may be useful or even

necessary to achieve validity as well as reliability. How can we best combine information

from multiple measures? This brings us back to a theme that has arisen repeatedly throughout

the book; that is, we can often summarize the information in a set of p variables or items by

creating a weighted linear composite or, sometimes, just a unit weight sum of scores for the

set of p variables.

19.10 Typical Scale Development Study

Page 21: Combining Scores Multi Item Scales

21

If an existing multiple-item measure is available for the variable of interest, such as

depression, it is usually preferable to employ an existing measure for which we have good

evidence about reliability and validity. However, occasionally, a researcher would like to

develop a measure for some construct that has not been measured before or develop a

different way of measuring a construct for which the existing tests are flawed. An outline of a

typical research process for scale development appears in Figure 19.19. In this section, the

steps included in this diagram are discussed briefly. Although the examples provided involve

self-report questionnaire data, comparable issues are involved in combining physiological

measures or observational data.

Insert Figure 19.19

Figure 19.19 Possible Steps in the Development of a Multiple-Item Scale

19.10.1 Generating and Modifying the Pool of Items or Measures

When a researcher sets out to develop a measure for a new construct (for which there are

no existing measures) or a different measure in a research domain where other measures have

been developed, the first step is the generation of a pool of “candidate” items. There are many

ways in which this can be done. For example, to develop a set of self-report items to measure

“Machiavellianism” (a cynical, manipulative attitude toward people), Christie and Geis

(1970) drew on the writings of Machiavelli for some items (and also on statements by P. T.

Barnum, another notable cynic). To develop measures of love, Rubin (1970) drew on writings

about love that ranged from the works of classic poets to the lyrics of popular songs. In some

cases, items are borrowed from existing measures; for example, a number of research scales

have used items that are part of the MMPI. However, there are copyright restrictions on the

use of items that are part of published tests.

Brainstorming by experts, and interviews, focus groups, or open-ended questions with

Page 22: Combining Scores Multi Item Scales

22

members of the population who are the focus of assessment can also provide useful ideas

about items. For example, to develop a measure of college student life space, including

numbers and types of material possessions, Brackett (2004) interviewed student informants,

visited dormitory rooms, and examined merchandise catalogs popular in that age group.

A theory can be extremely helpful as guidance in initial item development. The early

interview and self-report measures of the global Type A behavior pattern drew on a

developing theory that suggested that persons prone to cardiovascular disease tend to be

competitive, time urgent, job-involved, and hostile. The behaviors that were identified for

coding in the interview thus included interrupting the interviewer and loud or explosive

speech. The self-report items on the Jenkins Activity Survey, a self-report measure of global

Type A behavior, included questions about eating fast, never having time to get a haircut, and

being unwilling to lose in games even when playing checkers with a child (Jenkins, Zyzanski,

& Rosenman, 1979).

It is useful for the researcher to try to anticipate the factors that will emerge when these

items are pretested and FA is performed. If a researcher wants to measure satisfaction with

health care, and the researcher believes that there are three separate components to

satisfaction (evaluation of practitioner competence, satisfaction with rapport or “bedside

manner,” and issues of cost and convenience), then he or she should pause and evaluate

whether the survey includes sufficient items to measure each of these three components.

Keeping in mind that a minimum of 4 to 5 items are generally desired for each factor or scale

and that not all candidate items may turn out to be good measures, it may be helpful to have

something like 8 or 10 candidate items that correspond to each construct or factor that the

researcher wants to measure.

19.10.2 Administer Survey to Participants

Page 23: Combining Scores Multi Item Scales

23

The survey containing all the candidate items should be pilot tested on a relatively small

sample of participants; it may be desirable to interview or debrief participants to find out

whether items seemed clear and plausible and whether response alternatives covered all the

options people might want to report. A pilot test can also help the researcher judge how long

it will take for participants to complete the survey. After making any changes judged

necessary based on the initial pilot tests, the survey should be administered to a sample that is

large enough to be used for FA (see Chapter 18 for sample size recommendations). Ideally,

these participants should vary substantially on the characteristics that the scales are supposed

to measure (because a restricted range of scores on T, the component of the X measures that

taps stable individual differences among participants, will lead to lower inter-item

correlations and lower scale reliabilities).

19.10.3 Factor Analyze Items to Assess the Number and Nature of Latent Variables or

Constructs

Using the methods described in Chapter 18, FA can be performed on the scores. If the

number of factors that are obtained and the nature of the factors (i.e., the groups of variables

that have high loadings on each factor) are consistent with the researcher’s expectations, then

the researcher may want to go ahead and form one scale that corresponds to each factor. If the

FA does not turn out as expected, for example, if the number of factors is different from what

was anticipated or if the pattern of variables that load on each factor is not as expected, the

researcher needs to make a decision. If the researcher wants to make the FA more consistent

with a priori theoretical constructs, it may be necessary to go back to Step 1 to revise, add,

and drop items. If the researcher sees patterns in the data that were not anticipated from

theoretical evaluations (but the patterns make sense), he or she may want to use the empirical

factor solution (instead of the original conceptual model) as a basis for grouping items into

Page 24: Combining Scores Multi Item Scales

24

scales. Also, if a factor that was not anticipated emerges in the FA, but there are only a few

items to represent that factor, the researcher may want to add or revise items to obtain a better

set of questions for the new factor.

In practice, a researcher may have to go through these first three sets several times; that

is, the researcher may run FA, modify items, gather additional data, and run a new FA several

times until the results of the FA are clear, and the factors correspond to meaningful groups of

items that can be summed to form scales.

Note that some scales are developed based on the predictive utility of items rather than on

the factor structure; for these, DA (rather than FA) might be the data reduction method of

choice. For example, items included in the Jenkins Activity Survey (Jenkins et al., 1979)

were selected because they were useful predictors of a person having a future heart attack.

19.10.4 Development of Summated Scales

After FA (or DA), the researcher may want to form scales by combining scores on

multiple measures or items. There are numerous options at this point.

1. One or several scales may be created (depending on whether the survey or test measures

just one construct or several separate constructs).

2. Composition of scales (i.e., selection of items) may be dictated by conceptual grouping

of items or by empirical groups of items that emerge from FA. In most scale

development research, researchers hope that the items that are grouped to form scales

can be justified both conceptually and empirically.

3. Scales may involve combining raw scores or standardized scores (z scores) on multiple

items. Usually, if the variables use drastically different measurement units (as in the

example above where an SES index was formed by combining income, years of

education, and occupational prestige rating), z scores are used to ensure that each

Page 25: Combining Scores Multi Item Scales

25

variable has equal importance.

4. Scales may be based on sums or means of scores across items.

19.10.5 Assess Scale Reliability

At a minimum, the internal consistency of each scale is assessed, usually by obtaining a

Cronbach’s alpha. Test-retest reliability should also be assessed if the construct is something

that is expected to remain reasonably stable across time (such as a personality trait), but high

test-retest reliability is not a requirement for measures of things that are expected to be

unstable across time (such as moods).

19.10.6 Assess Scale Validity

If there are existing measures of the same theoretical construct, the researcher assesses

convergent validity by checking to see whether scores on the new measure are reasonably

highly correlated with scores on existing measures. If the researcher has defined the construct

as something that should be independent of verbal ability or not influenced by social

desirability, the researcher should assess discriminant validity by making sure that

correlations with measures of verbal ability and social desirability are close to 0. To assess

concurrent and predictive validity, scores on the scale can be used to predict current or future

group membership and current or future behaviors, which it should be able to predict. For

example, scores on Zick Rubin’s Love Scale (Rubin, 1970) were evaluated to see if they

predicted self-rated likelihood that the relationship would lead to marriage and whether

scores predicted which dating couples would split up and which ones would stay together

within the year or two following the initial survey.

19.10.7 Iterative Process

At any point in this process, if results are not satisfactory, the researcher may “cycle

back” to an earlier point in the process; for example, if the factors that emerge from FA are

Page 26: Combining Scores Multi Item Scales

26

not clear or if internal consistency reliability of scales is low, the researcher may want to

generate new items and collect more data. In addition, particularly for scales that will be used

in clinical diagnosis or selection decisions, normative data are required; that is, the mean,

variance, and distribution shape of scores must be evaluated based on a large number of

people (at least several thousand). This provides test users with a basis for evaluation. For

example, for the BDI (Beck et al., 1961), the following interpretations for scores have been

suggested based on normative data for thousands of test takers: scores from 5 to 9, normal

mood variations; 10 to 18, mild to moderate depression; 19 to 29, moderate to severe

depression; and 30 to 63, severe depression. Scores of 4 or below on the BDI may be

interpreted as possible denial of depression or faking good; it is very unusual for people to

have scores that are this low on the BDI.

19.10.8 Create Final Scale

When all the criteria for good quality measurement appear to be satisfied (i.e., the data

analyst has obtained a reasonably brief list of items or measurements that appears to provide

reliable and valid information about the construct of interest), a final version of the scale may

be created. Often such scales are first published as tables or appendixes in journal articles. A

complete report for a newly developed scale should include the instructions for the test

respondents (e.g., what period of time should the test taker think about when reporting

frequency of behaviors or feelings?); a complete list of items, statements, or questions; the

specific response alternatives; indication whether any items need to be reverse coded; and

scoring instructions. Usually, the scoring procedure consists of reversing the direction of

scores for any reverse-worded items and then summing the raw scores across all items for

each scale. If subsequent research provides additional evidence that the scale is reliable and

valid, and if the scale measures something that has a reasonably wide application, at some

Page 27: Combining Scores Multi Item Scales

27

point, the test author may copyright the test and perhaps have it distributed on a fee per use

basis by a test publishing company. Of course, as years go by, the contents of some test items

may become dated. Therefore, periodic revisions may be required to keep test item wording

current.

19.11 Summary

To summarize, measurements need to be reliable. When measurements are unreliable, it leads

to two problems. Low reliability may imply that the measure is not valid (if a measure does

not detect anything consistently, it does not make much sense to ask what it is measuring). In

addition, when researchers conduct statistical analyses, such as correlations, to assess how

scores on an X variable are related to scores on other variables, the relationship of X to other

variables becomes weaker as the reliability of X becomes smaller; the attenuation of

correlation due to unreliability of measurement was discussed in Chapter 7. To put it more

plainly, when a researcher has unreliable measures, relationships between variables usually

appear to be weaker. It is also essential for measures to be valid: If a measure is not valid,

then the study does not provide information about the theoretical constructs that are of real

interest. It is also desirable for measures to be sensitive to individual differences, unbiased,

relatively inexpensive, not very invasive, and not highly reactive.

Research methods textbooks point out that each type of measurement method (such as

direct observation of behavior, self-report, physiological or physical measurements, and

archival data) has strengths and weaknesses. For example, self-report is generally low cost,

but such reports may be biased by social desirability (i.e., people report attitudes and

behaviors that they believe are socially desirable, instead of honestly reporting their actual

attitudes and behaviors). When it is possible to do so, a study can be made much stronger by

including multiple types of measurements (this is called “triangulation” of measurement). For

Page 28: Combining Scores Multi Item Scales

28

example, if a researcher wants to measure anxiety, it would be desirable to include direct

observation of behavior (e.g., “um”s and “ah”s in speech and rapid blinking), self-report

(answers to questions that ask about subjective anxiety), and physiological measures (such as

heart rates and cortisol levels). If an experimental manipulation has similar effects on anxiety

when it is assessed using behavioral, self-report, and physiological outcomes, the researcher

can be more confident that the outcome of the study is not attributable to a methodological

weakness associated with one form of measurement, such as self-report.

The development of a new measure can require a substantial amount of time and effort. It

is relatively easy to demonstrate reliability for a new measurement, but the evaluation of

validity is far more difficult and the validity of a measure can be a matter of controversy.

When possible, researchers may prefer to use existing measures for which data on reliability

and validity are already available.

For psychological testing, a useful online resource is the American Psychological

Association FAQ on testing: www.apa.org/science/testing.html.

Another useful resource is a directory of published research tests on the Educational

Testing Service (ETS) Test Link site www.ets.org/testcoll/index.html, which has information

on about 20,000 published psychological tests.

Although most of the variables used as examples in this chapter were self-report

measures, the issues discussed in this chapter (concerning reliability, validity, sensitivity,

bias, cost effectiveness, invasiveness, and reactivity) are relevant for other types of data,

including physical measurements, medical tests, and observations of behavior.

Appendix: The CESD Scale

INSTRUCTIONS: Using the scale below, please circle the number before each statement

which best describes how often you felt or behaved this way DURING THE PAST WEEK.

Page 29: Combining Scores Multi Item Scales

29

1 Rarely or none of the time (less than 1 day)

2 Some or a little of the time (1–2 days)

3 Occasionally or a moderate amount of time (3–4 days)

4 Most of the time (5–7 days)

The total CESD depression score is the sum of the scores on the following twenty

questions with Items 4, 8, 12, and 16 reverse scored.

1. I was bothered by things that usually don’t bother me.

2. I did not feel like eating; my appetite was poor.

3. I felt that I could not shake off the blues even with help from my family or friends.

4. I felt that I was just as good as other people. (reverse worded)

5. I had trouble keeping my mind on what I was doing.

6. I felt depressed.

7. I felt that everything I did was an effort.

8. I felt hopeful about the future. (reverse worded)

9. I thought my life bad been a failure.

10. I felt fearful.

11. My sleep was restless.

12. I was happy. (reverse worded)

13. I talked less than usual.

14. I felt lonely.

15. People were unfriendly.

16. I enjoyed life. (reverse worded)

17. I had crying spells.

18. I felt sad.

Page 30: Combining Scores Multi Item Scales

30

19. I felt that people dislike me.

20. I could not get “going.”

A total score on CESD is obtained by reversing the direction of scoring on the four

reverse-worded items (4, 8, 12, and 16), so that a higher score on all items corresponds to a

higher level of depression, and then summing the scores across all 20 items.

Appendix Source: Radloff, L. S. (1977). The CESD Scale: A self-report depression scale for

research in the general population. Applied Psychological Measurement, 1, 385–401.

WWW Links: Resources on Psychological Measurement

American Psychological Association www.apa.org/science/testing.html

Goldberg’s International Personality Item Pool—royalty-free versions of scales that

measure “Big Five” personality traits

http://ipip.ori.org/ipip/

Mental Measurements Yearbook Test Reviews online

http://buros.unl.edu/buros/jsp/search.jsp

PsychWeb information on psychological tests

www.psychweb.com/tests/psych_tests

Page 31: Combining Scores Multi Item Scales

31

Figure 19.5 Computing a Recoded Variable (Dep4) From the Reverse Scored Item Revdep4

Page 32: Combining Scores Multi Item Scales

32

Figure 19.7 Computation of Brief Five-Item Version of Depression Scale: Adding Scores Across Items Using Plus Signs

Page 33: Combining Scores Multi Item Scales

33

Figure 19.8 Combining Scores from Five Items Using the SPSS MEAN Function (Multiplied By Number of Items)

Page 34: Combining Scores Multi Item Scales

34

Figure 19.11 SPSS Menu Selections for Reliability Procedure

Page 35: Combining Scores Multi Item Scales

35

Figure 19.12

SPSS Reliability Analysis for Five CESD Items: Dep1, Dep2, Dep3, Dep4, Dep5

NOTE: Dep4 is the recoded version of revdep4, corrected so that the direction of scoring is the same as for other items on the scale.

Page 36: Combining Scores Multi Item Scales

36

Figure 19.13 Statistics Selected for SPSS Reliability Analysis

Page 37: Combining Scores Multi Item Scales

37

Figure 19.14 SPSS Syntax for Reliability Analysis

Page 38: Combining Scores Multi Item Scales

38

Figure 19.15

SPSS Output from First Reliability Procedure for Scale: Briefcesd

Reliability Statistics

Cronbach's Alpha

Cronbach's Alpha Based

on Standardized

Items N of Items

.585 .614 5

Inter-Item Correlation Matrix

dep1 dep2 dep3 dep4 dep5

dep1 1.000 .380 .555 .302 .062dep2 .380 1.000 .394 .193 .074dep3 .555 .394 1.000 .446 .115dep4 .302 .193 .446 1.000 -.129dep5 .062 .074 .115 -.129 1.000

Item-Total Statistics

Scale Mean if Item

Deleted

Scale Variance if

Item Deleted

Corrected Item-Total Correlation

Squared Multiple

Correlation

Cronbach's Alpha if Item

Deleted

dep1 5.6701 5.786 .511 .341 .455dep2 5.7010 5.941 .398 .195 .504dep3 5.6082 4.845 .615 .434 .365dep4 7.4742 5.710 .294 .237 .562dep5 4.8247 7.042 .032 .055 .703

Page 39: Combining Scores Multi Item Scales

39

Figure 19.16

Output for the Second Reliability Analysis: Scale reduced to Four Items

NOTE: dep5, “Trouble Concentrating”, has been droppedReliability Statistics

Cronbach's Alpha

Cronbach's Alpha Based

on Standardized

Items N of Items

.703 .712 4

Inter-Item Correlation Matrix

dep1 dep2 dep3 dep4

dep1 1.000 .380 .555 .302dep2 .380 1.000 .394 .193dep3 .555 .394 1.000 .446dep4 .302 .193 .446 1.000

Item-Total Statistics

Scale Mean if Item

Deleted

Scale Variance if

Item Deleted

Corrected Item-Total Correlation

Squared Multiple

Correlation

Cronbach's Alpha if Item

Deleted

dep1 3.1753 4.625 .541 .341 .617dep2 3.2062 4.811 .407 .194 .686dep3 3.1134 3.810 .633 .419 .542dep4 4.9794 4.166 .410 .204 .702

Page 40: Combining Scores Multi Item Scales

40

Figure 19.17 note to copyeditor: this figure remains the same as in the first edition

SPSS Output: Cronbach Alpha Reliability for 20 Item CES-D Scale

Scale: CESDTotal

Case Processing Summary

94 95.94 4.1

98 100.0

ValidExcludeda

Total

CasesN %

Listwise deletion based on allvariables in the procedure.

a.

Reliability Statistics

.880 20

Cronbach'sAlpha N of Items

Page 41: Combining Scores Multi Item Scales

41

Figure 19.xx Possible Steps in the Development of a Multiple Item Scale.