test difficulty and stereotype threat on the greb general test · test difficulty and stereotype...

Test Difficulty and Stereotype Threat on the GREB General Test

Lawrence J. Stricker Isaac I. Bejar

GRE Board Report No. 96-06R

July 1999

This report presents the findings of a research project funded and carried

out under the auspices of the Graduate Record Examinations Board

Educational Testing Service, Princeton, NJ 0854 1

********************

Researchers are encouraged to espress freely their professional judgment. Therefore, points of view or opinions stated in Graduate

Record Examinations Board Reports do not necessarily represent official Graduate Record Examinations Board position or policy.

********************

The Graduate Record Examinations Board and Educational Testing Service are dedicated to the principle of equal opportunity, and their programs,

services, and employment policies (are guided by that principle.

EDUCATIONAL TESTING SERVICE, ETS, the ETS logo, GRADUATE RECORD EXAMINATIONS, and GRE are registered trademarks of Educational Testing Service.

The modemized ETS logo is a trademark of Educational Testing Service.

Educational Testing Service Princeton, New Jersey 0854 1

Copyright 0 1999 by Educational Testing Service. All rights reserved.

Acknowledgements

The authors wish to thank Margaret L. Redman and Manfred Steffan for assistance in all stages of this project; Brent Bridgeman and Donald A. Rock for advising on experimental design and statistical analysis; Martha L. Stocking for advising on the modification of the computer adaptive test; Karen M. Anselmo, James Briggs, Eileen H. Knott, Robert T. Patrick, Jr., Maria T. Potenza, and Steve Quimby for assisting in the modification of the test; Barbara A. Mecak for modifying the test; Joshua Aronson and Brigitte M. Hammond for constructing or providing word fragment items; Robert Gangi for collecting women stereotype data; Sandra Harris, Catherine Hombo, Akihito Kamata, P. Adam Kelly, Joy Lynn Matthews-Lopez, Kevin Meara, and Eva Ponte for collecting data for the main study; Debra E. Friedman, Regina B. Mercadante, and Susan J. Miller for coding and preparing the data for analysis; Min hwei Wang for doing the computer analysis; and Carol A. Dwyer, Maria T. Potenza, Donald E. Powers, and Gita Z. Wilder for reviewing a draft of this report.

Abstract

Recent research suggests that stereotype threat may adversely affect the performance of Black and female examinees on the Graduate Record Examination (GRE@) General Test, but that this threat may be minimized by using easier test items. The present study investigated the ability to reduce or eliminate stereotype threat by manipulating the difficulty of items administered via a computer-adaptive version of the General Test; the generalizability of these findings for Black examinees as well as women, and for Verbal as well as Quantitative sections of the test; and the processes that may mediate the effects of stereotype threat on test performance. The standard version of the computer-adaptive General Test or a modified version that presents examinees with items that are easier than usual, and a battery of measures of stereotype threat and possible symptoms or consequences of stereotype threat were administered to college seniors bound to graduate school and to first-year graduate students. Reducing test difficulty did not affect test performance or explicit indexes of stereotype threat for any group, but it lowered the anxiety of White students and women and raised the self-esteem of White students.

Table of Contents

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ........ 1

Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............... 2 Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 2 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 8

Results and Discussion ............................................................................................................... .9 Experimental Manipulation ............................................................................................. .9 Reliability ........................................................................................................................ 9 Intercorrelations ............................................................................................................ 10

Analyses of Covariance .................................................................................................. 10

Conclusions.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ........ 17

Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21

Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .............. 31

Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . , . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 9

I

Introduction

Recent experiments by Steele and Aronson (1995) found that telling undergraduates that a test made up of difficult Graduate Record Examination (GRE@) General Test verbal items was diagnostic of their ability or asking them about their ethnicity depressed the test performance of Black students but did not affect White students. Steele and Aronson explain their results as coming about because these experimental manipulations prime Black students’ concerns about fulfilling the negative racial stereotype regarding their intellectual ability; these concerns, in turn, interfere with the students’ performance on the test. (See also Steele, 1997; Aronson, Quinn & Spencer, 1998) Based on a series of studies that elicited stereotype threat in a variety of ways for Black students taking verbal tests (Steele & Aronson, 1995) and for women taking quantitative tests (Spencer, Steele, & Quinn, 1997) Steele and his co-workers (Aronson et al., 1998; Spencer et al., 1997; Steele, 1997; Steele & Aronson, 1995) suggest that this phenomenon may help to account for the deficit on standardized tests and in academic performance in school that is observed for Black, female, and other groups of students that are targets of negative stereotypes about their ability.

A new study by Croizet and Claire (1998) which extended this line of work from ethnicity and sex to social class, partially replicated key findings by Steele and Aronson with French undergraduates. In this study, eliciting stereotype threat by describing the verbal test as diagnostic depressed the performance of working class students but not middle class students; eliciting threat by asking the students about their social class did not affect either group. Similarly, another study by Shih, Pittinsky, and Ambaddy (1999) observed that inquiry about gender-related issues depressed the performance of Asian female high school and college students on a quantitative test. However, two field experiments by Stricker and Ward (1999) using operational tests (AP Calculus AB Examination and College Placement Test Battery) in actual test administrations for high school and college students, found that asking about ethnicity and sex did not affect the test performance of either Black or White, or female or male students.

A noteworthy finding in the Spencer et al. research was that stereotype threat for women appeared to be greater with a difficult test (drawn from the GRE Mathematics Test) than with an easier test (based on the GRE General Test’s Quantitative section). Both men and women had lower performance on the hard test, but women had lower performance than men on the hard test and did not differ from men on the easy test. The patterns were similar for two possible symptoms of stereotype threat: state self-esteem about performance and state anxiety. Women taking the hard test reported less self-esteem and more anxiety than women taking the easy test, whereas men taking the two tests did not differ on these variables. In addition, mirroring the test performance findings, women taking the hard test reported less self-esteem than men, whereas women and men did not differ on the easy test.

In view of the potential bearing of the stereotype threat findings on the General Test and the GRE Board’s concern with equity in assessment, as reflected in the FAME (Fairness, Access, Multiculturalism, and Equity; e.g., Educational Testing Service, 1998) efforts, it is important to investigate the role stereotype threat plays in the test performance of minority, female, and other examinees and to explore steps that might be taken to ameliorate any adverse effects that

2

stereotype threat may have on test performance. In this connection, the Spencer et al. observation that stereotype threat depends on how difficult the test is perceived to be raises the real possibility that stereotype threat can be minimized or even eliminated by manipulating the difficulty of the items. Such a manipulation can be readily accomplished with the current General Test, which is computer-adaptive. Adaptive tests are ordinarily designed to tailor the difficulty of items to the examinee’s level of ability, for psychometrically optimal results, but the difficulty can be tailored to an easier level. The Spencer et al. findings imply that an easier test would protect an examinee from stereotype threat. Making a test easier would necessarily decrease its reliability to some extent because the test is less optimal psychometrically (e.g., Thissen & Mislevy, 1990) but it would enhance the test’s construct validity insofar as irrelevant variance associated with stereotype threat is reduced.

Accordingly, the purpose of this study was to replicate conceptually and extend the Spencer et al. work to the computer-adaptive General Test and the GRE test-taking population. More specifically, the goals were to investigate (a) the ability to reduce or minimize stereotype threat by manipulating test difficulty, (b) the generalizability of these findings from women to Black examinees and from quantitative items to verbal items, and (c) the processes that may mediate the effects of stereotype threat on test performance. The hypothesis was that Black _examinees’ performance on the Verbal and Quantitative sections and women’s performance on the Quantitative section would be increased, and both group’s stereotype threat and its symptoms would be decreased, on an easier version of the General Test.

Sample Method

The sample consisted of 343 students. They were (a) White and Black, (b) college seniors planning to attend graduate school immediately after graduation and first-year graduate students, (c) enrolled at seven universities (Auburn University, Florida State University, Michigan State University, Ohio University, University of California at Berkeley, University of Florida, and University of Massachusetts, (d) graduates of high schools in the United States, and (e) nonusers of the POWERPREP (Educational Testing Service, 1995) test preparation software for the General Test (the software includes the same form of the test used in this study). The students were paid volunteers recruited on their campuses. They were told that the study concerned possible change in the computerized General Test, the participants were potential or actual General Test examinees, and they would take the test and some questionnaires.

The sample had 170 White and 173 Black examinees; 158 men and 185 women; 223 seniors and 120 graduate students; and 25 humanities, 194 social science, 99 biological and physical science, and 25 not ascertained undergraduate majors or graduate fields.* The number of examinees per university ranged from 3 5 at the University of California at Berkeley to 6 1 at Auburn University.

’ College seniors were classified on the basis of their undergraduate major, and first-year graduate students on the basis of their graduate field.

3

Measures

Almost all of the measures in this study were either identical to or very similar to those used by Spencer et al. (1997) and Steele and Aronson (1995). The test performance measures were the General Test scores, Mean Time for Verbal Items, and Mean Time for Quantitative Items. The General Test in this study had the same kind of items as those used by Spencer et al. and Steele and Aronson, but because the easier and standard versions of the test in this study were computer adaptive, scores on the two can be directly compared, for the scores are based on the difficulty and other psychometric characteristics of the items passed, not the number of items passed (e.g., Dorans, 1990). Hence, an examinee should be expected to earn the same score on both versions, unless stereotype threat or other test-taking motivation effects intrude. Spencer et al., in contrast, used easy and hard tests that were not equated, complicating comparisons of scores on the two, for an examinee should ordinarily be expected to achieve a higher score on the easy test.

The stereotype threat measures were the Black Stereotype Activation Scale, the Stereotype Threat Scale, and the Women Stereotype Activation Scale. The latter was not used in the Spencer et al. and Steele and Aronson research.

The stereotype symptom measures were the State Anxiety Scale (Spielberger, 1983) State Self-Esteem Scale (Heatherton & Polivy, 199 l), Revised Test Anxiety Scale (Benson & El-Zahhar, 1994) State Metacognitive Inventory (O’Neil & Abedi, 1996) Effort Scale (O’Neil, Sugrue, & Baker, 1995/1996), and Self-Doubt Activation Scale.

The manipulation check measure was the Test Difficulty Rating.

GRE General Test. The computer-adaptive General Test in the POWERPREP test preparation kit were employed. Standard and easier versions of the test’s Verbal and Quantitative sections were used (the Analytical section was eliminated to reduce time and cost).

The easier version was produced by altering the item selection algorithm that chooses items to be administered to the examinee. This algorithm selects items (a) to provide optimal information about the examinee’s ability, partly on the basis of item difficulty; (b) to ensure adequate coverage of the test’s content; and (c) to minimize, for security reasons, the overuse of the same items (e.g., Thissen & Mislevy, 1990). Two changes were made in the algorithm for the easier version. First, the difficulty of the items was reduced by a standard deviation (i.e., a difference of one unit on the Item Response Theory, IRT, b parameter for difficulty, which has a mean of 0 and a standard deviation of 1; e.g., Hambleton, 1989). Separate computer simulations for the Verbal and Quantitative sections, with 1,000 simulated examinees, indicated that this was the maximum reduction in difficulty that could be made while still maintaining a reliability of at least .90 for the section scores, in accord with the GRE Program’s psychometric standards. Second, controls for item overuse were eliminated to increase the available pool of items.

The difference in difficulty of the standard and easier versions of the test was assessed by simulations. These analyses, done separately for the Verbal and Quantitative sections, used

1,000 simulated examinees at each of several levels of ability (expressed on the IRT theta scale for ability, which has a mean of 0 and a standard deviation of 1; Hambleton, 1989). At each ability level, the mean difficulty (indexed by the IRT b parameter for difficulty) of the items in their order of administration to the examinees (1.. .30 for Verbal items, 1. . .28 for Quantitative items) was calculated for the standard and easier versions. (These statistics are portrayed in the figures in the Appendix.) The extent of separation between the sets of mean item difficulties for the standard and easier versions indicates the success of the difficulty manipulation. For both sections, this separation was clear for examinees representing 95% or more of the ability range (thetas from 4.88 to at least -1.63 for the Verbal section and thetas from at least 1.86 to at least -2.18 for the Quantitative section). The general absence of a difference in item difficulties for examinees at other ability levels and the occasional absence of a difference for examinees with otherwise clear separation of item difficulties reflects the many competing criteria, besides difficulty, that the item selection algorithm attempts to satisfy.

Other changes were made in both versions of the test to make them as close as possible to the operational test: the test was retitled the “GRE General Test” and the standard General Test instructions were substituted. However, at the end of the test, the reporting of test scores to the examinee was delayed until he or she completed other tasks in the study. In addition, paper- based tests, practice items, and other test preparation material in the POWERPREP kit were dropped.

Four measures were obtained:

a. Scaled score for the Verbal section.

b. Scaled score for the Quantitative section.

c. Mean time for verbal items. This is the mean time (in seconds) taken to answer the verbal items, controlling for the mean difficulty (IRT b parameter) of the items administered. -

d. Mean time for quantitative items. This is calculated in the same way as the corresponding variable for verbal items.

Current Thoughts Questionnaire. This questionnaire, its instructions adapted from Heatherton and Polivy (199 1; “This questionnaire contains a series of statements about what you are thinking RIGHT NOW.. . .“), had five scales:

a. Performance State Self-Esteem. This is the seven-item Performance subscale from the State Self-Esteem Scale (e.g., “I feel confident about my abilities”). A minor change was made in one item.

b. Social State Self-Esteem. This is the seven-item Social subscale from the State Self- Esteem Scale (e.g., “1 am worried about what other people think of me”). Three items were reversed in keying, to minimize acquiescence response style and a minor change was made in another item.

c. Appearance State Self-Esteem. This is the six-item Appearance subscale from the State Self-Esteem Scale (e.g., “I feel good about myself’). A minor change was made in one item.

d. Total State Self-Esteem. This is the total score for the three state self-esteem scales.

e. State Anxiety. This is the 20-item State Anxiety scale from the State-Trait Anxiety Inventory (e.g., “I am jittery”). A minor change was made in one item.

The five scales on this questionnaire were item analyzed for the total sample. All items had significant (E < .05) positive biserial correlations with the total score on their scale (corrected for item overlap).

Test Questionnaire. This questionnaire, its instructions adapted from O’Neil et al. (199511996; “This questionnaire contains a series of statements about how you felt or what you did DURING THE EXPERIMENTAL TEST”), had seven scales:

a. Self-Checking. This is a four-item scale made up of items from the Self-Checking scale on the State Metacognitive Inventory (e.g., “I corrected my errors on the test”). Two of the items were reversed in keying.

b. Effort. This is a four-item scale made up of items from the Effort Scale (e.g., “I tried to do my best on the test”). Two of the items were reversed in keying, and one of these items had other minor changes.

c. Worry. This is a four-item scale made up of items, adapted to describe the test in this study (and reversed in keying as needed), from the Worry subscale on the Revised Test Anxiety Scale (e.g., “While taking the test, I often thought about how difficult it was”).

d. Tension. This is a five-item scale adapted from the Tension subscale on the Revised Test Anxiety Scale and, to secure sufficient items, from the Test Anxiety Inventory (Spielberger, 1980) and Reactions to Tests (Samson, 1984) (e.g., “During the test I felt very tense”).

e. Test-Irrelevant Thinking. This is a four-item scale adapted from the Test-Irrelevant Thinking subscale on the Revised Test Anxiety Scale (e.g., “During the test I sometimes thought about being somewhere else”).

f. Bodily Symptoms. This is a five-item scale adapted from the Bodily Symptoms subscale on the Revised Test Anxiety Scale (e.g., “My mouth felt dry during the test”).

g. Total Test Anxiety Scale. This is the total score for the four test anxiety scales.

The seven scales on this questionnaire were item analyzed. All items had significant positive correlations with the total scores on their scales.

Academic Questionnaire This questionnaire (its instructions were, “This questionnaire consists of a series of statements about academic matters”) consisted of a ten-item Stereotype

6

Threat Scale, modeled after the eight-item scale used by Steele and Aronson (1995), but with all of the items focusing on others’ perceptions of the ability of people of the examinee’s ethnicity and sex (e.g., “The experimenter expected me to do poorly because of my race or sex”), plus 30 filler items on miscellaneous academic topics. Three of the items on the scale were adapted from the Steele and Aronson scale (the complete scale was unavailable); the ten items were balanced in keying. The other items on the scale and the filler items were original.

This scale was item analyzed. All items had significant positive correlations with the total score.

Word Fragments. This test, modeled after one used by Steele and Aronson (1995), had instructions adapted from the Incomplete Words Test (Ekstrom, French, & Harman, 1976) and Graf and Williams (1987):

This exercise contains a list of word fragments. Look at each word fragment and then fill in the missing letters to complete a word.. . .

Each word fragment can be completed with more than one word. Complete the word fragment with the FIRST word you can find that fits, even if you can think of other words that also fit. There is no right or wrong answer.

Use English words and no proper names (e.g., Lincoln, Korea, Sony).

Work quickly; do not spend too much time on any one word fragment. If you cannot complete a word fragment, skip it and go on to the next one.

The test consisted of three scales (totaling 3 1 items) and 3 1 filler items on miscellaneous topics. The filler items were randomly drawn from lists of filler items provided by J. Aronson (personal communication, October 29, 1996) after eliminating items identified as hard by a sample of 6 out of 8 college-educated adults, and with the constraint that the items have approximately the same distribution of the number of blanks and contiguous blanks as the 3 1 items on the three scales on the test.

The three scales were:

a. Black Stereotype Activation. This is an 11 -item scale made up of items from the 12- item scale used by Steele and Aronson (1995) (e.g., CO _-_ [COLOR]) (The other item was unavailable.) This scale was item analyzed. Three items had significant positive correlations with the total score, too few to justify a short form of this scale.

b. Women Stereotype Activation. This is a 13-item scale modeled after the Black Stereotype Activation Scale (e.g., GR --- FUL [GRACEFUL]). A set of 56 words were derived from lists of stereotypes about women identified by Deaux and Lewis (1983) and by

7

Orlofsky (198 l)-3 1 from the Deaux and Lewis list for characteristics rated more applicable to women than men (a mean difference of .5 standard deviations or more) and 25 from the Orlofsky list for items rated as more typical of women (a mean difference with p < .OOl). These words were converted into word fragments, with approximately the same distribution of blanks as the Black Stereotype Activation Scale items. Four word fragment items-two derived from the Deaux and Lewis list and two from the Orlofsky list-were eliminated because a sample of six out of eight college-educated adults identified them as hard.

The remaining 52 words- 29 from the Deaux and Lewis list and 23 from the Orlofsky list-were administered in a questionnaire to 3 1 graduate students (13 men and 18 women) at the New School University, with these instructions, adapted from Devine and Elliott (1995):

This rating form contains two lists of words that describe stereotypes about women. Please look at the words on each list and check off the ten words on the list that clearly describe the most common stereotypes about women, whether or not you personally believe the stereotypes to be true.

Twenty-five words-13 from the Deaux and Lewis list and 12 from the Orlofsky list- were chosen by the students at greater than chance level.

The 52 word fragment items were administered to 35 undergraduates (4 men and 3 1 women) in two psychology classes at Rider University with the same instructions as the final test. Fourteen items-six from the Deaux and Lewis list and eight from the Orlofsky list-were chosen because an appreciable number (between approximately 15% and 85%) of the women gave a stereotype response and an appreciable number gave another correct response. One item was subsequently dropped because it was considerably longer than the items on the Black Stereotype Activation Scale. The remaining 13 items were selected for the scale. This process for selecting word fragment items was modeled after the one used by Gilbert and Hixon (1991) to insure that the word fragments for stereotype words could be completed readily with both stereotyped and nonstereotyped words.

This scale was item analyzed. Three items had significant positive correlations with the total score.

c. Self-Doubt Activation Scale. This is the same 7-item scale used by Steele and Aronson (1995; e.g., DU -- [DUMB]). This scale was item analyzed. One item had a significant positive correlation with the total score.

Test difficulty rating. This is a ten-point rating of the difficulty of the test, “Taking all things together, how difficult was the experimental test?“, with ratings ranging from 1 (Not at all Difficult) to 10 (Extremely Difficult). This measure is modeled after the 15-point rating used by Steele and Aronson (1995).

Procedure

Students were tested individually on campus by a research assistant. (The seven research assistants, one per university, consisted of three men and four women; two with White, two with Asian, and one each with Black, Hispanic, and American Indian ethnicity.) When the students reported to the laboratory, the assistant asked them to complete an informed consent form and to read a written description of the study:

About the Study

Educational Testing Service is trying out changes in the computerized version of the Graduate Record Examination (GRE) General Test with students who have already taken the GRE test or are planning to take it in the near future. You will be taking an experimental form of this test today, and you will get your scores at the end. These scores should be comparable to those on the regular GRE test. You will also be completing several related questionnaires after the test.

This experimental test may be very different from the computerized or paper-and- pencil GRE test that you may have taken or may know about from test coaching courses or books. The questions may be arranged differently, and they may be easier or harder then the questions in the regular GRE test.

Please try to do your best on this experimental test so that the scores you get are like those on the regular GRE test.

Your scores on the test and the other data will be kept confidential and only seen by the staff of this research project.

If you have any questions, please speak to the graduate research assistant.

Students were then randomly assigned to conditions: those taking the standard version of the General Test (the control group) and those taking the easier version (the experimental group). The students took the test, which was self-administering. After they finished it (but before the scores were reported to them), the students were asked to complete a background information questionnaire (covering sex, ethnicity, educational information, etc.) and then the other measures in this order: Current Thoughts Questionnaire, Test Difficulty Rating, Test Questionnaire, Word Fragments, and Academic Questionnaire. After the students completed these measures, their General Test scores were reported to them.

Analysis

The product-moment intercorrelations of the General Test variables and stereotype threat measures were computed, separately for the experimental and control groups, using a pair-wise missing data program (Word Fragments scales were unavailable for one student). The reliability of the scales on the questionnaires and on Word Fragments was estimated by Coefficient Alpha.

9

A series of 2 (Experimental vs. Control) x 2 (Ethnicity) x 2 (Sex) factorial analyses of covariance of the two test scores and the 24 other variables were carried out, using the least squares method (Model II error term; Overall & Spiegel, 1969) to deal with unequal Ns. A total of nine covariates were used: a set of six dummy variables for universities and a set of three dummy variables for undergraduate majors or graduate fields. The dummy variables for University of California at Berkeley and not ascertained major/field were excluded to eliminate the dependency among the respective set of dummy variables.

Planned comparisons of simple effects of the experimental vs. control group factor for each ethnic group (e.g., Black students in the experimental group vs. Black students in the control group) and each sex (e.g., women in the experimental group vs. women in the control group) were also conducted (Howell, 1997).

Note that the analyses of covariance (and comparisons of simple effects) use unweighted means. Both statistical and practical significance were considered in evaluating the results. For statistical significance, an .05 alpha level was used in all analyses, including the planned comparisons of simple effects; Keppel, 199 1). For practical significance, an 71 of. 10 and an r of . 10 were employed (Cohen’s 1988, definition of a “small” affect size, accounting for 1% of the variance).

Results and Discussion

Exnerimental Maninulation

The adjusted mean Test Difficulty Rating was 5.76 for the experimental group and 6.62 for the control group (t = 4.3 8, p < .O 1, q = .23). In short, examinees in the experimental group judged the test to be appreciably easier than those in the control group.

Reliability

The reliability of the questionnaire and Word Fragments scales for the experimental and control groups is reported in Table 1. The reliability was similar for both groups. The reliability was moderate (about .7 or more) for Stereotype Threat, the four state self-esteem scales, State Anxiety, and two of the test anxiety scales (Tension and Test-Irrelevant Thinking). The reliability was marginal (about .4 to .6) for two of the test anxiety scales (Worry and Bodily Symptoms), Self-checking, and Effort, and minimal (.2 or less) for the three Word Fragment scales.

Insert Table 1 here

10

Intercorrelations

General Test Variables. The intercorrelations of the General Test scores, Mean Time for Verbal Items, and Mean Time for Quantitative Items for the experimental and control groups are reported in Table 2. The intercorrelations were similar for both groups. The Verbal and Quantitative scores correlated highly (. 5 to .7), as did Mean Time for Verbal Items and Mean Time for Quantitative Items. The Verbal score and Mean Time for Verbal Items correlated moderately or minimally (.O to .3), as did the Quantitative score and Mean Time for Quantitative Items.

Stereotype threat measures. The intercorrelations of the stereotype threat measures for the experimental and control groups are reported in Table 3. The intercorrelations were similar for both groups. Black Stereotype Activation, Women Stereotype Activation, and Stereotype Threat correlated minimally (.O to .2) with each other.

Insert Table 2 and 3 here.

Analvses of Covariance

The analyses of covariance of the 20 General Test scores and related variables, and the other dependent variables, are summarized in Tables 4 and 5; the corresponding adjusted means for the ethnic and sex subgroups in the experimental and control groups are shown in Table 6.

Main Effects and interactions between ethnicity and sex. Ten of the 20 main effects for experimental vs. control group were statistically significant (p<.OS), and all ten were also practically significant (q>. 10). The experimental group’s means were lower than the control group’s for Mean Time for Verbal Items, Mean Time for Quantitative Items, State Anxiety, Worry, Tension, and Total Test Anxiety, and higher than the control group’s for Performance State Self-Esteem, Social State Self-Esteem, Total State Self-Esteem, and Effort.

Nine of the main effects for ethnicity were statistically significant, and all nine were also practically significant. The Black students’ means were lower than the White students’ for Verbal Score and Quantitative Score, and higher than the White students’ for Mean Time for Verbal Items, Mean Time for Quantitative Items, Black Stereotype Activation, Stereotype Threat, Test-Irrelevant Thinking, Total Test Anxiety, and Self-Doubt Activation.

Nine of the main effects for sex were statistically significant and all nine were also practically significant . The women’s means were lower than the men’s for Verbal Score, Quantitative Score, Black Stereotype Activation, Performance State Self-Esteem, Appearance State Self-Esteem, and Total State Self-Esteem, and higher than the men’s for Stereotype Threat, State Anxiety, and Worry.

One of the interactions between ethnicity and sex, for Stereotype Threat, was statistically significant; it was also practically significant. This interaction, shown in Figure 1, indicates

11

higher threat for White women than White men, and no difference for Black men and Black women.

Interactions between experimental vs. control group and ethnicity. Five of the 20 two- way interactions of experimental vs. control group with ethnicity were statistically significant, and all five were also practically significant. All involved dependent variables other than the Verbal and Quantitative scores and related measures: Social State Self-Esteem, Total State Self- Esteem, State Anxiety, Bodily Symptoms, and Total Test Anxiety. These interactions are portrayed in Figures 2 to 6. Four of the interactions indicate higher self-esteem and lower state anxiety for White students in the experimental group than those in the control group and no difference for Black students in the two groups. The remaining interaction, for Bodily Symptoms, indicates lower symptoms for White students in the experimental group than those in the control group, and vice versa for Black students.

Thirteen of the 40 simple effects for ethnicity were statistically significant, and all 13 were also practically significant. None of them concerned the Verbal and Quantitative scores, but several related variables were involved. Some of the simple effects occurred for both White and Black students. Both kinds of students had lower Mean Time for Verbal Items and Mean Time for Quantitative Items in the experimental group. All other effects occurred only for White students. Some concerned the interactions between experimental vs. control group and ethnicity noted earlier: State Anxiety and Total Test Anxiety, both lower in the experimental group, and Social State Self-Esteem and Total State Self-Esteem, both higher in the experimental group. Additional effects involved Worry and Test-Irrelevant Thinking, both lower in the experimental group, and Performance State Self-Esteem and Effort, both higher in the experimental group.

Interactions between experimental vs. control group and sex. One of the 20 two-way interactions of experimental vs. control group with sex was statistically significant; it was also practically significant. This interaction, for Tension, shown in Figure 7, indicates lower tension for women in the experimental group than those in the control group and no difference for men in the two groups.

Eleven of the 40 simple effects for sex were statistically significant, and all 11 were also practically significant. Paralleling the simple effects for ethnicity, none of these effects involved the Verbal and Quantitative scores, but several concerned related variables. Both men and women had lower Mean Time for Verbal Items and Mean Time for Quantitative Items in the experimental group. The other simple effects generally involved women only. One concerned the interaction between experimental vs. control group and sex noted earlier: Tension was lower for women in the experimental group. Other effects involved State Anxiety, Worry, and Total Test Anxiety, all lower for women in the experimental group, and Effort, higher for men in the experimental group.

Interactions between experimental vs. control group and ethnicity and sex. None of the 20 three-way interactions of experimental vs. control group with ethnicity and sex was statistically significant.

12

In summary, the experimental manipulation of test difficulty did not have any effect on General Test Scores or stereotype threat measures for Black students or women, much less White students and men. However, other variables that may be symptoms or consequences of stereotype threat were affected by the manipulation. Differential effects for White and Black students occurred on several clusters of variables. White students taking the easier test reported less state anxiety and test anxiety and more state self-esteem and effort than those taking the standard test, whereas Black students taking the two tests did not differ on these variables. Differential effects for men and women also occurred. Women taking the easier test reported less state anxiety and test anxiety than those taking the standard test, whereas men taking the two tests did not differ on these variables. And men taking the easier test reported more effort than those taking the standard test, whereas women taking the two tests did not differ on this variable.

Insert Tables 4 to 6 here

Insert Figures 1 to 7 here

Conclusions

Altering the difficulty of the General Test produced a variety of effects on the dependent variables, but differential effects for White and Black students and for men and women were limited. These effects did not include test performance or explicit indexes of stereotype threat, but only possible symptoms or consequences of stereotype threat, and most of the effects involved only White students and women.

It was hypothesized, on the basis of the Spencer et al. (1997) findings for women, and the likely implications of these results for Black examinees, that giving both groups easier tests would enhance their test performance and reduce their stereotype threat and its symptoms. The only support for this hypothesis comes from the fmding that women’s state anxiety and test anxiety were decreased for the easier tests, while men’s anxiety was unaffected. Women’s test performance, explict measures of stereotype threat, and other measures of possible symptoms or consequences of stereotype threat were not differentially affected. And none of the measures were differentially affected for Black students, contrary to what was hypothesized.

The reasons are unclear for the limited ability to replicate the Spencer et al. findings for women or to extend these results to Black examinees. Several issues are worth addressing.

First, the students in this study, like those in the Spencer et al. and Steele and Aronson (1995) research, were able and involved in academics, but, unlike the Spencer et al. subjects, the men and women and White and Black students in this study were not selected to be equal in ability, as reflected in admissions test scores and mathematics coursework in college. However, ability was indirectly controlled in the present study by covarying on major/field and university. Hence, the interactions between ethnic group or sex and experimental vs. control group should be at least roughly, if not precisely, comparable to those in the Spencer et al. research.

13

Furthermore, the focus of the analysis in the present study was on comparisons of each ethnic or sex subgroup in the experimental and control conditions, not comparisons of one subgroup (e.g., men) with another subgroup (e.g., women), making control for pre-existing differences unnecessary, given the random assignment of subjects to conditions. Hence, these contrasts are directly comparable to the within-sex comparisons in the Spencer et al. research.2

Second, although it is clear that the manipulation of test difficulty was successful overall, judging from the substantial mean difference in global ratings of difficulty by students in the experimental and control groups and the patterns of main effects for experimental vs. control groups (ten of the 20 main effects were statistically and practically significant), direct evidence is lacking of the effectiveness of the separate manipulations of difficulty for the Verbal and Quantitative sections. The simulation data obtained in connection with the modification of the General Test for this study indicated that the difficulty of the items in both sections was successfully manipulated over most of the ability range. Some circumstantial evidence for effectiveness comes from the main effects for experimental vs. control groups for Mean Time for Verbal Items and Mean Time for Quantitative Items, both of which had significantly lower means for the experimental group.

2 Although the emphasis in this study was on contrasts within ethnic groups and within sexes, it is worth noting that contrasts within the experimental and control groups for Verbal and Quantitative scores and Performance State Self-Esteem, corresponding to the comparisons done by Spencer et al. for quantitative test scores and the same self-esteem measure, also diverged from their fmdings of lower test pefiormance and lower reported self-esteem for women than men on the hard test but no differences for women and men on the easy test.

In the ethnicity analyses, the mean differences for White and Black examinees for the Verbal and Quantitative scores were statistically and practically significant in both the experimental and control groups, with Black examinees having consistently lower means. For the Verbal score, F_ = 40.00, p < -01, q = -27 for the experimental group, and F = 41.71, p < .Ol,

q = .28 for the control group. For the Quantitative score, F = 3 6.2 1, p < .O 1, TJ = .27, for the experimental group, and F = 26.30, p < .O 1, q = .23 for the control group. The mean difference for Performance State Self-Esteem was not significant for either the experimental or control group, F = 3 56, p ?05, r\ = . 10 for the experimental group, and F_ = .27, p > .05, q = .03 for the control group.

In the sex analyses, the mean differences for men and women were not significant for the Verbal score for either the experimental or control groups, F = 3.22, p > .05, q = .08 for the experimental group, and F = 4.7 1, p < .05, q = .09 for the control group. But the mean differences were signif&& for the Quantitative score, for both the experimental and control groups, with women consistently having lower means, F = 5.97, p < .O 1, q = . 11 for the experimental group, and F = 4.98, p < .05, IJ = . 10 for the control group. The mean difference for Performance State Self-Esteem was significant for both the experimental and control groups, with women having consistently lower means, F = 6.19, p < .O 1, q = .13 for the experimental group, and F = 8.85, p < .Ol, q = .16 for the co&o1 group.

14

Third, two of the three explicit indexes of stereotype threat, the Black Stereotype Activation and Women Stereotype Activation word fragment measures, had severe psychometric limitations that hindered their ability to detect the experimental effects for Black students and women. Both measures were not only very unreliable but also correlated minimally with the corresponding stereotype threat questionnaire measure. However, the means for Black Stereotype Activation were significantly higher for Black than White students, though the means for Women Stereotype Activation were not significantly different for women and men. And Stereotype Threat, despite its substantial reliability, had a similar pattern of group differences as the word fragment measures: significantly higher means for Black than White students but no significant differences in means for women and men.

Fourth, is it possible that the level of stereotype threat associated with the General Test, a test that is well known to most or all of the students, who are either bound for graduate school or already in it, is so high or so intractable that manipulating difficulty, as was done in this study, cannot appreciably reduce the threat? For what it is worth, given the limitations of the stereotype threat measures, the level of threat appeared to be modest in this study. For example, for the total sample of Black students, the mean score was 1.16 out of a possible 11 on Black Stereotype Activation and 4.71 out of 10 on Stereotype Threat; for the total sample of women, the mean score was 4.16 out of 13 on Women Stereotype Activation and 3.16 out of 10 on Stereotype Threat, Verbal and quantitative items from the General Test were successfully used in eliciting stereotype threat in the Spencer et al. and Steele and Aronson research, but the source of the items was not identified to the undergraduate students, and they may not be knowledgeable about the test.

Fifth, the present study had ample statistical power to detect the experimental effects. The sample size was 343 (the Spencer et al. sample size was 56) and the power was about .99 for all effects (main effects, first-order interactions, and second-order interactions), using the .05 alpha level and a “medium” (q = .24) effect size (Cohen, 1988).

In sum, this study was largely unable to reduce stereotype threat, its symptoms, or its effects on performance on the General Test by altering test difficulty, apart from decreases in women’s anxiety. It should be borne in mind that this was a laboratory experiment with a General Test analog, like the Spencer et al. research that stimulated it, not a field study with an operational version of this test. Although the level of stereotype threat in this study appeared to be moderate, the threat should be heightened for an operational test, it is uncertain whether changing test difficulty would be more effective or less effective in that situation. Other research on stereotype threat with operational tests makes it clear that laboratory findings may diverge markedly from the results of field studies (Stricker & Ward, 1999). Given the potential relevance of stereotype threat in explaining deficits in test performance of minority groups, women, and other targets of negative stereotype about their intellectual ability (Spencer et al., 1997; Steele, 1997; Steele & Aronson, 1995) the real world consequences of this phenomenon and means of ameliorating it merit further attention.

Although the focus was on stereotype threat, this study has broader implications for test- taking motivation, in general, and the General Test, in particular. First, as already noted, anxiety

15

was reduced and self-esteem and effort were increased on the easier test for some groups of students, though test performance was unchanged. The same pattern occurred for the total sample. And the correlations of these variables with test performance were generally unchanged, implying that the meaning of the test scores was unaffected.3 Nonetheless, improving the morale of the test takers might be a desirable goal in and of itself, and this could be done by making the General Test easier, without noticeably degrading the psychometric characteristics of the test, judging from the simulations mentioned earlier.

Second, manipulating difficulty is only one of the changes that can be readily made in the General Test, given the flexibility of its computer-based format, in an effort to enhance test- taking motivation. One such change worth considering is “self-adapted testing” (see the review by Rocklin, 1994). This is a variant of computer-adaptive testing that allows the examinee to decide about the difficulty of each item that is administered to him or her. Several studies that compared self-adapted and conventional computer-adaptive tests found that self-adapted tests were less influenced by test anxiety, presumably because anxiety is elicited by items that the examinee perceives to be difficult. These findings clearly suggest that a test’s construct validity may be enhanced by using a self-adapted format. This approach and others that capitalize on computer-based testing in the interest of equity in assessment merit further attention.

3 State hxiety, Total Test Anrriety, Total Self-Esteem, and Effort correlated .06, .09,Z = .28,p .05; -.24, -.14, Z = .95,p> .05; .13, .04,Z= .83,p> .05; and .l5, .O5,Z= .93,p> .05, with the Verbal score in the f=&efimentd and control gr&ps, respectively. The four variables correlated .02, .09, Z = .64, E > .05; -.30, .OO,_Z= 2.83,~~ .Ol; .22, .09,z= 1.23,p> .05; .17, -.Ol, Z= 1.67,p> .O5, with the Quantitative score in the experimental and control groups, respectivelv. - M

I7

References

Aronson, J., Quinn, D. M., & Spencer, S. J. (1998). Stereotype threat and the academic underperformance of minorities and women. In J. K. Swim & C. Stangor (Eds.), Prejudice--The target’s perspective. San Diego: Academic Press.

Benson, J., & El-Zahhar, N. (1994). Further refinement and validation of the Revised Test Anxiety Scale. Structural Equation Modeling, 3, 203-221.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum.

Croizet, J.-C., & Claire, T. (1998). Extending the concept of stereotype threat to social class: The intellectual underperformance of students from low socioeconomic backgrounds. Personalitv and Social Psvchologv Bulletin, 24, 588-594.

Deaux, K., & Lewis, L. L. (1983). Assessment of gender stereotypes: Methodology and components. Psvcholoaical Documents, I3, 25.

DeVine, P. G., & Elliott, A. J. (1995). Are racial stereotypes really fading? The Princeton trilogy revisited. Personalitv and Social Psvcholonv Bulletin, 2l, 1139-l 150.

Dorans, N. J. (1990). Scaling and equating. In H. Wainer (Ed.). Computerized adaptive testing: A primer. Hillsdale, NJ: Erlbaum.

Educational Testing Service (1995). POWERPREP: Preparing for the GRE General Test [computer software]. Princeton, NJ: Author.

Educational Testing Service (1995). New directions in assessment for higher education: Fairness. access, multiculturalism. & equitv (FAME) (GRE, FAME Report Series, Vol. 1). Princeton, NJ: Author.

Ekstrom, R. B., French, J. W., & Harman, H. H. (1976). Manual for the Kit of Factor- Referenced Cognitive Tests. 1976. Princeton, NJ: Educational Testing Service.

-,;_ Gilbert, D. T., & Hilton, J. G. (1991). The trouble of thinking: Activation and application

of stereotypic beliefs. Journal of Personalitv and Social PsvchologL 60, 509-5 17.

Graf, P., & Williams, D. (1987). Completion norms for 40 three-letter word stems. Behavior Research Methods, Instruments. & Computers, 19, 422-445.

Hambleton, R. K. (1989). Principles and selected applications of item response theory. In R. L. Linn (Ed.), Educational measurement (3rd ed.). New York: American Council of Education and Macmillan.

18

Heather-ton, T. F., & Polivy, J. (1991). Development and validation of a scale for measuring state self-esteem. Journal of Personalitv and Social Psvchologu, 60, 895-9 10.

Howell, D. C. (1997). Statistical methods for Psychology (4th ed.). Belmont, CA: Wadsworth.

Keppel, G. (1991). Design and analvsis--A researcher’s handbook (3rd ed.). Upper Saddle River, NJ: Prentice-Hall.

O’Neil, H. F., Jr., & Abedi, J. (1996). Reliability and validity of a state metacognitive inventory: Potential for alternative assessement. Journal of Educational Research, 89, 234-245.

O’Neil, H. F., Jr., Sugrue, B., & Baker, E. L. (1995/1996). Effects of motivational interventions on the National Assessment of Educational Progress mathematics performance. Educational Assessment, 3, 135- 157.

Orlofsky, J. L. (198 1). Relationship between sex role attitudes and personality traits and the Sex Role Behavior Scale-- 1: A new measure of masculine and feminine role behaviors and interests. Journal of Personality and Social Psvcholoay, 40, 927-940.

Overall, J. E., & Spiegel, D. K. (1969). Concerning least squares analysis of experimental data. Psvcholoaical Bulletin, 72, 3 11-322.

Rocklin, T. (1994). Self-adapted testing. Applied Measurement in Education, 1, 3-14.

Sarason, I. G. (1984). Stress, anxiety, and cognitive interference: Reactions to tests. Journal of Personalitv and Social Psvchology, 46, 929-938.

Shih, M., Pittinsky, T. L., & Ambady, N. (1999). Stereotype susceptibility: Identity salience and shifts in quantitative performance. Psvcholonical Science, l0, 8 l-84.

Spencer, S. J., Steele, C. M., & Quinn, D. (1997). Stereotvne threat and women’s math performance. Manuscript submitted for publication.

Spielberger, C. D. (1980). Preliminarv professional manual for the Test Anxiety Inventory. Palo Alto, CA: Consulting Psychologists Press.

Spielberger, C. D. (1983). Manual for the State-Trait Anxiety Inventory. Palo Alto, CA: Consulting Psychologists Press.

Steele, C. M. (1997). A threat in the air--How stereotypes shape intellectual identity and performance. American Psvchologist, 52, 6 13-629.

-_ L-

Steele, C. M., & Aronson, J. (1995). Stereotype threat and the intellectual test performance of African Americans. Journal of Personalitv and Social Psvcholoay, 69, 797-8 11.

19

Stricker, L. J., & Ward, W. C. (1999). Stereotvpe threat. inquiring about ,examinees’ ethnicitv and sex. and standardized test Performance. Manuscript submitted for publication.

Thissen, D., & Mislevy, R. J. (1990). Testing algorithms. In H. Wainer (Ed)., Computerized adaDtive testing: A primer. Hillsdale, NJ: Erlbaum.

21

Table 1

Reliabilitv of Measures in Exoerimental and Control Grouns

Scale Reliability

Experimental Control

Black Stereotype Activation .15

Women Stereotype Activation .08

Stereotype Threat .80

Performance State Self-Esteem .75

Social State Self-Esteem 73

Appearance State Self-Esteem .68

Total State Self-Esteem .84

State Anxiety .88

Self-Checking .47

Effort .49

Worry SO

Tension .77

Test-Irrelevant Thinking .78

Bodily Symptoms .45

Total Test Anxiety .79

Self-Doubt Activation .15

.22

.20

.78

.78

.65

.72

.85

-88

.54

.55

.68

.77

.78

.29

.80

.13

Note. Reliability is estimated by Coefficient Alpha.

22

Table 2

Intercorrelations of General Test Scores and Related Variables

Variable

(I)

Variable

(2) (3) (4)

1. Verbal Score .56 -.29 -.15

2. Quantitative Score .71 -.05 -.05

3. Mean Time for Verbal Items -.08 -.07 .50

4. Mean Time for Quantitative Items .08 .08 .64

Note. Correlations for the control group appear above the diagonal; correlations for the experimental group appear below it. Correlations of. 15 and .19 are significant at the .05 and .Ol levels (two-tail), respectively, for the control group; the corresponding values are .15 and .20 for the experimental group.

23

Table 3

Intercorrelations of Stereotvne Threat Measures

Scale (1)

Scale (2) (3)

1. Black Stereotype Activation .14 .06

2. Women Stereotype Activation .09 .oo

3. Stereotype Threat .14 .02

Note. Correlations for the control group appear above the diagonal; correlations for the experimental group appear below it. Correlations of. 15 and .19 are significant at the .05 and .Ol levels (two-tail), respectively, for the control group; the corresponding values are .15 and -20 for the experimental group.

Table 4

Summary of Analyses of Covariance of General Test Scores. Related Variables, and Stereotvpe Threat Measures

Source df

Mean Time Mean Time Black Women Verbal Quantitative for Verbal for Quantitative Stereotype Stereotype Stereotype Score Score Items Items Activation Activation Threat

Experimental-Control (E-C)

Ethnicity

E-C x Ethnicity

White

Black

Sex

E-C x Sex

Male

Female

Ethnicity x Sex

E-C x Ethnicity x Sex

1

1

1

1

1

1

1

1

1

1

1

.47 .Ol

77.62**” 59.13**”

.oo 54

.24 .20

.23 .35

7.85”“” 10.96**a

.06 .03

.09 .oo

.47 .04

.05 .98

1.92 1.23

Error 3251326 (8637.47) (13344.76) (89.93) (268.6 1)

19.26**a

11 .34**a

.45

6.88**a

12.66**a

.21

.37

7.68**a

1 1.56**a

1.72

1.26

19.75**a

1 8.44**a

.19

1 1.86**a

7.95**a

.93

.02

8.59**”

11 .28**a

1.27

.43

.Ol

6.97**a

.Ol

.oo

.02

4.39*a

.02

.oo

.03

.37

.44

(1.00)

.oo

.16

.82

.38

.44

1.57

.48

.25

.24

.23

.Ol

(2.63)

.02

3 12.33**”

1.35

.85

.53

3 .89*a

.06

.Ol

.08

9.71 **a

.Ol

(2.89)

Note. The df is 325 for the analyses of Black Stereotype Activation and Values enclosed in parentheses represent Mean Square Errors. Women Stereotype Activation Scales and 326 for the analysis of the Stereotype Threat Scale. p<.O5;** p<.Ol;“q>. 10.

Table 5

Summarv of Analvses of Covariance of Other Measures

Source Performance State Social State Appearance State Total State State Self- Effort

df Self-Esteem Self-Esteem Self-Esteem Self-Esteem Anxiety Checking

Experimental-Control (E-C) 1

Ethnicity 1

E-C x Ethnicity 1

White 1

Black 1

Sex 1

E-C x Sex 1

Male 1

Female 1

Ethnicity x Sex 1

E-C x Ethnicity x Sex 1

Error 326

10.12”“”

.93

3.09

12.21 **a

.98

14.91**a

.lO

3.78””

6.60* *a

.22

2.03

(3.09)

4.22*a

2.86

6.72**”

10.84**”

.15

2.55

.41

3.37

1.07

.oo

.09

(3.47)

.31

1.37

2.38

2.22

.49

14.48**a

.23

.oo

.58

.02

.Ol

(2.14)

6. 17**a 6.57”*a

.58 .04

6.1 ltta 3.75””

12.33**a 10. 15**a

.oo .19

13.87**a 4.18””

.oo .18

2.82 2.12

3.37 4.78*a

.06 -37

.20 .44

(16.97) (31.89)

.95

.21

.ll

.20

.85

1.69

-03

.31

.69

2.88

-05

(1.40)

9.40” *=

1.12

1.33

8.90””

1.79

.23

2.78

10.40”“”

1.04

.19

-87

(1.14)

Table 5 (continued)

Source df Worry Tension Test-Irrelevant

Thinking Bodily Total Test Self-Doubt

Symptoms Anxiety Activation

Experimental-Control (E-C) 1

Ethnicity 1

E-C x Ethnicity 1

White 1

Black 1

Sex 1

E-C x Sex 1

Male 1

Female 1

Ethnicity x Sex 1

E-C x Ethnicity x Sex 1

Error 326

1 2.22**a

.55

2.89

13.52”“”

1.57

11 .68**a

1.53

2.35

1 2.07**a

.57

2.65

(1.42)

8.87**”

2.98

3.62

11.94’“”

.56

3.37

3.83*a

.47

13.14”“”

.88

1.83

(1.99)

2.3 1

5.73*a

2.96

5.27”’

.02

I.72

.80

2.71

.21

.Ol

.04

(2.15)

.Ol

1.70

5.09*a

2.78

2.36

.29

.04

.04

.Ol

1.08

1.07

(-68)

9.98’“” 1.67

5.48*a 7.41 SSa

7.26’*a .26

17.18**” 1.63

.lO .30

2.36 .07

.67 .oo

2.52 .82

8.52”“” .84

-75 .31

1.69 2.34

(11.41) (-84)

Note. Values enclosed in parentheses represent Mean Square Errors. *p<.O5;** pc.01; “q>. 10.

Table 6

Adiusted Means of Variables for Ethnic Groups

Ethnicitv

Variable EXP

White Con Total EXP

Black Con Total

S.D.”

Verbal Score 513.82

Quantitative Score

Mean Time for Verbal Items

Mean Time for Quantitative Items

Black Stereotype Activation

Women Stereotype Activation

Stereotype Threat

Performance State Self-Esteem

Social State Self-Esteem

Appearance State Self-Esteem

583.07

-3.90 -.06

-8.62 .lO

.86 .86

4.01

1.12

6.19

5.31

4.77

506.82

575.02

5 10.32 419.07

579.04

-1.98

-4.26

.86

471 .Ol

-.90

412.17

481.59

4.33

.30

1.15

7.46

1.17

4.16 4.09 4.09

1.36 1.24 4.81

3.93 4.01

4.62 4.71

5.25 5.72 5.66 5.39

4.37

4.43

4.84 4.42 4.53

4.60 4.72 4.87 4.80 1.46

415.62 92.94

476.30 115.52

1.71

3.88

1.16

9.48 4

16.39

1.00

1.62

1.70

5.52 1.76

4.48 1.86

N

Table 6 (continued)

Ethnicitv

Variable EXP

White Con Total ExP

Black Con Total

S.D.”

Total State Self-Esteem

State Anxiety

Self-Checking

Effort

Worry

Tension

Test-Irrelevant Thinking

Bodily Symptoms

Total Test Anxiety

Self-Doubt Activation

16.27

4.29

2.57

14.04

6.63

2.65

3.32 2.83

1.14

SO

.97

.41

3.03

-86

1.82

1.25

1.49 1.23

.63

5.19

1.04

15.16

5.46

2.61

3.07

1.48

.88

.52

4.11

.95

14.79 14.80

5.19 5.51

2.46 2.63

3.05 2.83

1.47 1.70

1.08 1.24

1.66 1.62

.74 .55

4.94 5.11

1.20 1.28

14.80

5.35

2.55

2.94

1.58

1.16

1.64

.64

5.03

1.24

4.12

4.77

1.18

1.07

1.19

1.41

1.47

.83

3.38

.92

“Calculated from the Mean Square Error in the analyses of covariance.

Table 7

Adjusted Means of Variables for Men and Women, and Experimental and Control Grouns

Sex

Variable ExP

Men Con Total EXP

Women Con Total

Group

ExP Con

Verbal Score

Quantitative Score

Mean Time for Verbal Items

Mean Time for Quantitative Items

Black Stereotype Activation

Women Stereotype Activation

Stereotype Threat

Performance State Self-Esteem

Social State Self-Esteem

Appearance State Self-Esteem

479.44 475.02

549.08 548.15

477.23

-2.96

-4.90

1.13

4.00

2.21

548.61

-.37

2.79 -1.06

1.12 1.12 .88 .91

3.87 3.94

2.80 2.78 2.79

6.27 5.72 5.99

5.10

5.01

4.55

5.00

4.82 4.64

5.00

453.45

505.00

-1.85

-3.42

4.10

3.12

5.59

4.47

443.97 448.7 1

508.46 506.73

2.05

4.76

.lO

.67

.90

4.22

3.20

4.16 4.05 4.05

3.16

4.91

4.35

4.31

5.25

4.50

4.39

466.44

527.04

-2.40

-4.16

1.01

2.96

5.93

4.87

4.74

459.50

528.30

2.13 N

3.78

1.01

2.99

5.32

4.45

4.65

\o

Table 7 (continued)

Sex

Variable ExP

Men Con Total ExP

Women Con Total

Group

ExP Con

Total State Self-Esteem

State Anxiety

Self-Checking

Effort

Worry

Tension

Test-Irrelevant Thinking

Bodily Symptoms

Total Test Anxiety

Self-Doubt Activation

16.37 15.26

4.32

2.61

3.31

1.16

5.43

2.71

2.76

80

1.35

1.46

.96

1.73

55 .57

3.86 4.72

1.02 1.15

15.82 14.70

4.87 5.16

2.66 2.42

3.03 3.06

1.31 1.45

88 .78

1.54 1.28

.56 .61

4.29 4.12

1.09 1.05

13.57

6.71

2.57

2.90

2.06

1.54

1.40

.60

5.59

1.18

Note. Standard deviations are reported in Table 6.

14.14

5.94

2.49

2.98

1.76

1.16

1.33

.61

4.85

1.11

15.53 14.42

4.74

2.51

6.07

2.64

3.19

1.31

.79

1.31

-58

3.99

1.03

2.83 w 0

1.76

1.24

1.56

.59

5.15

1.16

5

4

1

0

-o- +

I I

Men Women

Figure 1. Interaction of ethnicity with sex for Stereotype Threat.

6

5

4

Control Group

4 White

+ Black

Experimental Group

Interaction of experimental vs. control group with ethnicity for Social State Self-Esteem. Figure 2.

17

16

15

14

13

White

Black

Control Group Experimental Group

Interaction of experimental vs. control group with ethnicity for Total State Self-Esteem. Figure 3.

7

6

4

White

Black


Interaction of experimental vs. control group with ethnicity for State Anxiety. Figure 4.

, 8

. 7

. 5

4 White

. 4


Interaction of experimental vs. control group with ethnicity for Bodily Symptoms. Figure 5.

6

5

4

3 .

+ White

+ Black

Control Group

-

Experimental Group

Figure 6. Interaction of experimental vs. control group with ethnicity for Total Test Anxiety.

2

0

4 Men

+ Women

I I


Figure 7. Interaction of experimental vs. control group with sex for Tension.

39

Appendix

Mean Item Diffkulty by Order of Admnistration for Standard and Easier Versions of General Test: Simulation Data for Examinees at Different Ability Levels

41

Thetahat Mllm 4.66

4

3

2

1

0

-1

-2

-3

-4

Thetahat value 2.61

4

3

2

1

0

-1

-2

-3

-4

Thetahat value 1.66

4

3

2

1

0

-1

-2

3

-4

Thetahat value 1.3

42

Thetahat value’ .8

4

3

2

1

0

-1

-2

3

4

Thetahat value .33

4

3

2

1

0

-1

-2

-3

-4

Thetahat value ) -.I 3

4

3

2

1

0

-1

-2

-3

4

Thetahat value -.59

3 --

-4-

43

Thetahat value -1.07 I +vwbal dldard +wbalexDelimental

4

3

2

1

0

-1

-2

3

4

Thetahat value -1.63

Thetahat value -2.34 -H-verbal standard

--t-verbal experimental

41 1


44

Thetahat mlue -5.86 -verbal standard

--H-verbal exl%wew

4 , 1 34 I 21 I

0

-1

-2

-31 -1 4’ J

Thetahatvalue1.24

4

3

2

1

0

-1

-2 -'

3 --

-4s

Thetahatvalue.81 ml

-2 -

-3 -'

4-

46

Thetahat value A3

Thetahat value .os (1

4

3

2

1

0

-1

-2

3

-4

Thetahat value -35

lIMahat value -.81 +quantiWve standard

+auantitatii exmsimental

47 I

47


4

3

2

1

0

-1

-2

3

4


4

3

2

1

0

-1

-2

3

-4