1
Running Head: EMPIRICAL KEYING OF SITUATIONAL JUDGMENT TESTS
Empirical Keying of Situational Judgment Tests:
Rationale and Some Examples
Kelley J. Krokos American Institutes for Research
Ph: 202.403.5259 Fx: 202.403.5033 [email protected]
Adam W. Meade
North Carolina State University Ph: 919.513.4857 Fx: 919.515.1716
April R. Cantwell North Carolina State University
Ph: 919.515.2251 Fx: 919.515.1716
Samuel B. Pond, III North Carolina State University
Ph: 919.515.2251 Fx: 919.515.1716 [email protected]
Mark A. Wilson
North Carolina State University Ph: 919.515.2251 Fx: 919.515.1716
2
Press Paragraph
Recently there has been increased interest in the use of situational judgment tests (SJTs) for
employee selection and promotion. SJTs have respectable validity coefficients with performance
criteria though validity coefficients vary from study to study. We propose the use of empirical
keying in order to help maximize the utility of SJTs. Though others have used such methods, we
provide a much needed theoretical rationale for such scoring procedures by illustrating the
distinction among SJTs, cognitive ability, and biodata. Results indicate that some empirical
keying approaches are advantageous for predicting a leadership criterion compared to traditional
subject matter expert SJT scoring.
Abstract
There has been increased interest in the use of situational judgment tests (SJTs) for employee
selection and promotion. We provide a much needed theoretical rationale for empirical keying
of SJTs. Empirical results indicate that some empirical keying approaches are more
advantageous than subject matter expert SJT scoring.
3
Empirical Keying of Situational Judgment Tests: Rationale and Some Examples
SJTs are becoming increasingly popular as selection, promotion, and developmental tools
(Clevenger, Pereira, Wiechmann, Schmitt, & Harvey, 2001; Hanson & Ramos, 1996; McDaniel,
Finnegan, Morgeson, Campion, & Braverman, 1997), and with good reason; several researchers
have had considerable success in predicting performance with SJTs (McDaniel, Morgeson,
Finnegan, Campion, & Braverman, 2001; Phillips, 1993; Weekley & Jones, 1999); with less
adverse impact than is typically found in measures of cognitive ability (Hanson & Ramos, 1996;
Motowidlo & Tippins, 1993; Weekley & Jones, 1997).
Despite these promising findings, one persistent problem with the use of SJTs is that
validity coefficients often vary widely. For example, some authors have found no significant
correlation between SJT scores and employee performance (Smiderle, Perry, & Cronshaw,
1994), while others have found validity coefficients of .45 (Phillips, 1993) and .56 (Stevens &
Campion, 1999). Still others have found widely divergent results for men and women (Phillips,
1992) or by construct examined (Motowidlo, Dunnette, & Carter, 1990). Undoubtedly, there are
also many unpublished studies of SJTs showing varying or non-significant results as well.
One possible explanation for these problems may lie in the way SJTs are scored.
Traditionally, subject matter experts (SMEs) have determined the “correct” responses to SJT
items. However, we propose that an empirical approach to item scoring has several theoretical
and practical advantages over the SME approach. Though we are not the first to suggest the use
of empirical keys for SJTs, we do provide a much needed theoretical rationale for their use that
has not previously been discussed. In this study, we discuss the advantages of empirical
approaches and illustrate their use with an SJT predicting a leadership criterion.
4
Scoring of SJTs. In a recent review, McDaniel and Nguyen (2001) describe approaches
to scoring SJTs. The first and most common approach is to ask subject matter experts (SMEs) to
decide which response alternative is best for each item. With this approach, items with little or
no SME agreement are deleted or rewritten. Results with the SME scoring approach vary though
results are generally positive. A second scoring approach identified by McDaniel and Nguyen
(2001) involves pilot testing an SJT and to identify the “correct” responses based on central
tendency statistics, though no example or explanation of how this should be implemented was
given. The last approach discussed by McDaniel and Nguyen (2001) is the use of empirical
methods to determine the scoring key.
Although empirical scoring approaches are rarely used for SJTs, some research evidence
suggests that SJTs scored in this way can yield moderate validity coefficients. Dalessio (1994)
successfully used an empirical keying technique for an SJT to predict turnover among insurance
agents. Weekley and Jones (1997) used empirical scoring based on mean criterion performance
of service workers and found a cross-validity coefficient of .22. Finally, although the
relationship among the SJT scores and performance criteria were not assessed, Lievens (2000)
developed an empirical scoring key for an SJT using correspondence analysis and discriminant
analysis.
In contrast to the paucity of studies examining empirical scoring procedures of SJTs,
biodata has a long history of using empirical scoring procedures. These procedures are easily
adaptable to SJT items. However, Hogan (1994) briefly reviewed the entire history of empirical
keying methods and found that few studies had compared different empirical keying procedures.
In one of the few studies comparing multiple empirical keying techniques, Devlin et al. (1992)
5
found that the vertical percent method (application blank method; England, 1961) was among the
best at predicting academic performance for college freshman with cross-validities typically in
the .4-.5 range. The horizontal percent (Stead & Shartle, 1940) and phi coefficient methods (c.f.
the Lecznar & Dailey (1950) correlational method), also proved useful in their study with
validities only slightly lower than those of the vertical percent methods they investigated (Devlin
et al., 1992). The mean criterion method had greater variation in cross validities across different
time spans, though the cross-validities were between .2 and .5. However, there was more
shrinkage in the cross-validation for this method than most others.
Rationale for Empirical Keying of SJTs
Though use of empirical keying for biodata has been criticized as “dust-bowl
empiricism” (Dunnette, 1962; Mumford & Owens, 1987; Owens, 1976), we believe that it may
actually be preferable to SME based scoring procedures for some SJTs. On the surface, SJTs
seem to be somewhat closely related to both cognitive ability tests and biodata. However, we
contend that SJTs are unique measurement methods (Hanson, Horgen, & Borman, 1998) and
thus have unique properties that make them particularly well suited for empirical keying.
First, we explicitly reject the notion of “correct” and “incorrect” answers for most SJTs.
The notion that SJTs should have correct and incorrect responses likely stems at least in part
from the relationships between SJTs and cognitive ability tests, which generally do have correct
and incorrect answers. First, research suggests that SJT scores are highly related to scores on
tests of general cognitive ability McDaniel et al. (1997, 2001). In addition, SJTs are used in
ways and contexts that are typical for the use of cognitive ability tests, such as personnel
selection. However, despite these relationships and temporal connections, SJT items, unlike
6
typical academic or cognitive ability test items, are not designed to have a single irrefutable
correct answer. In contrast, SJT items are typically designed to capture the more complex, social
or practical aspects of performance in work situations. McDaniel et al. (1997) suggest that SJTs
are indistinguishable from tests of tacit knowledge. To the extent that this is true, SJTs measure
something different than general cognitive or academic intelligence (Sternberg, Wagner,
Williams, & Horvath, 1995). To capture this type of knowledge, test items pose problems that
are not well defined and may have more than one correct response (Sternberg & Wagner, 1993;
Sternberg, Wagner, & Okagaki, 1993).
Finally, an examination of typical SJT items reveals that there is generally no clear right
or wrong answer. This is actually a desirable feature of SJTs as transparent items would quickly
lead to ceiling effects that would fail to discriminate between high and low performers. Note,
however, that this limitation is not an issue in many biodata scales where items can be based on
external, objective, and verifiable previous life experiences (Mael, 1991).
We believe that all response options for an SJT item vary along a continuum of best to
worst. The exact location of an option on this continuum is difficult to determine and will vary
by item and perhaps also by the job for which the applicant is applying. Some items may be
written with one, clearly best option while others may be written with less distinct response
alternatives. Transparent items lead to ceiling effects while ambiguous items make it
exceedingly difficult for SMEs to achieve consensus about the appropriateness of each option.
When SJT scores are based on a scoring key that was developed by SMEs, an SJT score
represents the extent to which each respondent agrees with the judgments of the SMEs. By
requiring a high degree of consensus among SMEs, researchers can increase the likelihood that
7
answers will not be too specific to the opinions of the particular group of SMEs. Unfortunately,
however, this procedure also increases the likelihood that correct answers will be the most
transparent options. Thus, more transparent items rather than less transparent ones are likely to
be retained when SMEs are employed to determine the keyed answer. In addition, the option
ultimately determined to be best by the SMEs will depend to some extent upon the unique
perspective of a particular SME group and the group dynamics involved in obtaining consensus.
Deciding between a SME based key and an empirical key is really a question of who will
serve as the SMEs. When traditional SME scoring is used, an SJT score is an index of
agreement among respondents and SMEs. The extent to which these scores are construct valid is
dependent upon both the validity of the SMEs’ conceptualization of the construct and the validity
of the SMEs assessment of the relationship between the response options and the construct. As
such, low validity coefficients for SME scored SJTs could be due to differences in perceptions of
the construct among respondents (e.g., job applicants) and SMEs (e.g., a small group of
supervisors); poor SME judgment as to which response option is most indicative of the construct;
or overly transparent “best” answers chosen not only by SMEs, but also by both high and low
performing respondents.
In contrast, when empirical keying is used, the de-facto “SMEs” are the high performing
respondents as measured by the criterion of interest. More specifically, response options that
best differentiate between high and low performing incumbents are given more weight than other
options, even though those options may in many ways seem to be better responses. Using
empirical keying, the most transparent option (and seemingly the best option) may be endorsed
by a majority of respondents; however if high and low performing respondents equally endorse
8
the response option, it will not differentiate between criterion groups and consequently will not,
in effect, be weighted. In contrast, a response option that is not endorsed frequently but is
endorsed much more often by high performing respondents than low performing respondents
will be weighted much more heavily with many of the empirical scoring methods. In general,
this will be desirable so long as the number of respondents endorsing the response option is not
so small as to endanger severe shrinkage in cross-validation. We should point out, though, that if
a criterion does not fully capture the performance domain, it might be preferable to use SME
judgment to determine the correct answers to SJT items. In such cases, however, attention to
better criterion development would be a pressing concern.
While much of the criticism leveled at empirical keying methods of biodata scoring
concerns the lack of theory behind the choice of predictors (Mumford & Owens, 1987; Owens,
1976), such criticism is not necessarily relevant to SJTs. SJT items are typically written based
on job analysis data and are thus believed to be related to job relevant behaviors and criteria from
their inception. As a result, empirical keying of these items serves to merely define the optimal
relationship between those items and the criterion.
In sum, we believe that there may be some utility in investigating empirical keying as an
alternative for SJT scoring. In this research, we investigated the use of empirical keying as an
alternative to the traditional SME based scoring procedures for an SJT developed to select
students receiving a highly competitive four-year scholarship at a major university.
Method
Participants
Participants were 219 undergraduate students (scholars) from a large university who were
9
recipients of a highly competitive four-year academic scholarship. Roughly 55% were female
and 45% were male while approximately 36% were freshmen, 26% sophomores, 22% juniors,
and 16% seniors. Note that while the sample is composed of students, this was not a lab study or
a sample of convenience; the students were the target population for the SJT.
Measures
Phase one of the project involved criterion development during which the appropriate
behaviors associated with four performance dimensions (Leadership, Scholarship, Service, and
Character) were identified. For example, the Leadership dimension included behaviors such as
knowing when to take a leadership versus a support role, being comfortable in ambiguous
situations, developing cooperative relationships, and handling conflict appropriately. Results for
the Leadership dimension are reported in this study in order to simplify the presentation of
results and because leadership is most readily generalizable to other organizational settings.
Phase two involved the development of the SJT item stems and response options. The
SJT item stems were developed by the program research team using the data gathered in the
criterion development phase. The response options were developed by a group of SMEs
including university faculty and the scholarship program directorate. The items and response
options subsequently underwent additional rigorous reviews and modifications by SMEs and the
research team. The final SJT was composed of three detailed scenarios that describe situations
that scholarship recipients may encounter. Each scenario was comprised of several multiple
choice items. Respondents were instructed to indicate which of the five response options they
would most likely do and which they would least likely do.
Phase three involved developing the SME based scoring key. SMEs who were both
10
intimately familiar with the scholarship program and who had advanced training in assessment
methodology determined the most effective answers for each item. For the most part, only
response options with more than 70% agreement by the SMEs were retained as the correct
option. However, in some cases there was less agreement among SMEs, and in these cases
preferential weighting was given to a core group of SMEs (i.e., the program’s director and one
key faculty advisor). In this study, we analyzed responses to the most likely questions in order to
simplify analyses and presentation of results.
Performance Criteria. Performance rating content and materials were developed in
phase four based on the data gathered during criterion development phase. Performance ratings
were made primarily by the scholarship program director. When clarification was needed, a
mentor or other program director was consulted for further information. Two dimensions of
leadership were rated independently: Effectiveness of leadership skills and actively seeks a
leadership role. Initial analyses indicated that these two ratings correlated highly (r=.79, p<.01),
thus these two were combined into a single index.
Procedure
SJT scores for the leadership dimension were calculated using the traditional SME
scoring approach and several empirical keying methods shown previously to be of some utility in
either SJT or biodata research. Small calibration sample size is the biggest determinant of
shrinkage in cross-validation (Hough & Paullin, 1994), therefore two-thirds of the total sample
was randomly assigned to the calibration sample while the remaining one-third was retained for
the cross-validation sample.
More specifically, six empirical keying techniques were investigated. Each technique
11
results in numeric values or weights for each option that when combined with the individual’s
score on the option (0 if the option was not selected, 1 if it was) the result was then used in a
regression equation that sought to predict performance on the leadership criterion from the
weighted options. The empirical techniques employed are described below.
Vertical and horizontal percent methods. In order to compute weights via the vertical
and horizontal percent methods, the calibration sample was first divided into high and low
performance groups with respect to the criterion. The sample was split into thirds based on
criterion scores and only the lowest and highest third of the sample was used for weighting.
Vertical percent weights were computed by taking the percentage of person in the high group
choosing each option and subtracting the percentage of persons in the low performing group
choosing that option. Horizontal weights were computed by taking the number of persons in the
high performance group choosing each response option and dividing this number by the total
number of people in the sample choosing that option. We then multiplied this number by ten to
derive the final horizontal weights (see Devlin et al., 1992).
Correlational Methods 1 and 2. The dichotomously scored response options were
correlated with the leadership performance criteria. The resulting zero-order correlation was
treated as the weight. For this study, we chose two alpha levels to retain response options as
predictors. In Correlational Method 1, we used α=.25 level which corresponded to zero-order
correlations of magnitude of roughly r=.10. For Correlational Method 2, we keyed only item
responses significant at the α=.10 level (roughly r=.14).
Mean criterion method. In order to generate the empirical scoring key for the mean
criterion method, we computed mean criterion performance scores associated with each response
12
option. These mean scores were then used as the empirical weights in computing predictor
scores for persons choosing each response option.
Unit weighting method. With the unit weighting method, response options associated
with the highest mean criterion scores were assigned a value of 1.0 while other responses were
assigned a value of 0. However, options that were associated with the highest mean criterion
were subject to the restriction that at least 10% of the sample must have chosen that option in
order to reduce the risk of significant results by chance alone (see Weekly & Jones, 1999).
Each of the empirical keying technique results in a numeric value for each response
option. This value was used as the beta weight in a regression equation that seeks to predict
performance on the leadership criterion using the weighted response options as predictors.
Results
Descriptive statistics for SME and empirical keying methods of scoring as well as the
criterion measure of performance are presented in Table 1. Table 2 contains correlations
between the predictor scores and criteria ratings.
As can be seen in Table 2, the SME-based scoring of the leadership dimension left much
to be desired. The SME-based leadership scores had only a marginally significant relationship
with performance for the calibration sample. However, the SME-based predictor was not
significantly related to performance for the cross-validation sample. The results of the empirical
keying approaches were decidedly mixed. Though all empirical keying approaches had large
significant correlations with performance in the calibration sample, only the correlational method
was significantly related to performance in the cross-validation sample.
Discussion
13
In this study, we found that the predictive validity of SJT could be improved by utilizing
several types of empirical keying procedures. In addition, we have detailed several theoretical
reasons why empirical keying may be preferable to SME scoring for some SJTs. However, we
also found that empirical keying is not a panacea for all that ails a predictor. Instead, we found
many techniques shown to be predictive of performance in biodata contexts of little use for our
SJT measure. Our study also illustrates the most pervasive problem of empirical scoring
procedures – a general lack of cross-validation. Validities in our sample shrank considerably
between calibration and cross-validation despite our best efforts to split the sample so that the
majority of the data was used to derive stable empirical keying weights.
Though previous authors have discussed some advantages of the correlational method
(Lecznar & Dailey, 1950; Weekley & Jones, 1999), we were somewhat surprised by the clearly
superior behavior of this type of empirical keying in our study. Perhaps this is because fewer
(but higher quality) predictors were used with the correlational method. The more selective of
the two correlational methods (Correlational Method 2) enjoyed considerably higher cross-
validities than did the less restrictive of the two. This is to be expected to some extent.
However, as item responses are included that have a weaker relationship with the criterion, these
responses are weighted less as well. Thus, these results were somewhat surprising.
Though the performance of the empirical keying approaches examined in this study often
fared no better than the SME approach, we stress some of the positive aspects of empirical
keying. First, empirical keying can serve as a validity check on SME ratings of the “correct”
response option. If the response option chosen by SMEs as the correct response does not
distinguish between high and low performers with respect to the criterion measure, then perhaps
14
that option is not correct after all. The unit weighting scoring procedure used in this study
exemplified this function. Researchers who reject the use of external keying procedures on
philosophical grounds may still derive benefit from its use as a validity check and as part of the
SJT development process. Requiring 75% agreement among SMEs to keep an item is a stringent
but common criteria (Legree, 1994; Lievens, 2000). Using the unit weighting approach in
combination with the SME approach may allow researchers to relax the agreement criteria
slightly given empirical information offered by the unit weighting empirical keying approach.
Secondly, we believe that the empirical keying approaches significantly improve upon
the SME scoring approach because they counter a number of its weaknesses and introduce
important information into the scoring process. Empirical keying approaches inherently reject
the notion of right and wrong answers to SJT items. That is, (most) empirical keying approaches
reward “partial credit” of sorts to a person choosing a response option that differentiates between
higher and lower performance on the criterion. For example, two options that relate highly with
the criterion would both be weighted strongly, rather than just one in traditional correct/incorrect
scoring. Conversely, negative weighting penalizes choices associated with poorer performance.
Also, practitioners not terribly comfortable with pure empirical keying could also
consider using a hybrid approach in which only items written to measure specific competencies
are used to predict criteria deemed relevant. With this approach, practitioners can maintain a
theoretical link between competencies and criterion if only items written to measure those
competencies are used as predictors, rather than all SJT items. This approach is not purely
empirical but instead is more akin to the family of approaches in biodata research known as
construct-based rational scoring (Hough & Paullin, 1994). With this type of scoring, the exact
15
nature of the relationship between the items and the construct (i.e., the scoring) is determined
empirically though the theoretical link between predictor and criteria remains.
As with any research, there are some potential limitations associated with this study. One
limitation is the use of a student sample and an SJT designed for use in this sample. However,
note that the SJT, while designed for use with a student sample, was rigorously developed. In
addition, we attempted to choose the construct most relevant to organizations in our
investigation. Though it is somewhat unlikely that an organization would hire an employee
based on an SJT designed to measure leadership, it is entirely possible that such an SJT might be
used as one factor in promotion decisions or for personal/career development purposes. Also,
although a student sample was used in this study, these students are among the best in the
country with remarkable standardized test scores, clear leadership in extracurricular activities,
and great promise and in fact expectations for future leadership positions.
A further limitation of the study was the relatively small sample used to derive the
external weights. When deriving weights that reflect the relationship between the item response
options and the criterion for the population as a whole, the larger the sample used to derive these
weights, the better (Hough & Paullin, 1994). Also, the SJT contained a relatively low number of
items. This was a function of both pragmatic concerns over test length and other design
considerations outside the control of the researchers. In general, large sample sizes and a large
number of items will lead to the best and most stable prediction of performance.
Another limitation was the low initial validity coefficient associated with the SME-based
approach. These low initial coefficients do not set a very high bar over which empirical keying
approaches must excel.
16
Despite these limitations, we feel that this study provides promising results. By
combining well-developed and content valid items with externally derived empirical scoring for
the item response options, we feel that an optimal balance can be struck for scoring an SJT.
17
References
Chan, D. & Schmitt, N. (1997). Video-based versus paper-and-pencil method of assessment in
situational judgment tests: Subgroup differences in test performance and face validity
perceptions. Journal of Applied Psychology, 82, 143-159.
Clevenger, J., Pereira, G. M., Wiechmann, D., Schmitt, N., & Harvey, V. S. (2001). Incremental
validity of situational judgment tests. Journal of Applied Psychology, 86, 410-417.
Dalessio, A. T. (1994). Predicting insurance agent turnover using a video-based situational
judgment test. Journal of Business and Psychology, 9, 23-32.
Devlin, S. E., Abrahams, N. M., & Edwards, J. E. (1992). Empirical keying of biographical data:
Cross-validity as a function of scaling procedure and sample size. Military Psychology, 4,
119-136.
Dunnette, M. D. (1962). Personnel management. Annual Review of Psychology, 13, 285-314.
England, G. W. (1961). Development and Use of Weighted Application Blanks. Dubuque, IA:
Brown.
Hanson, M. A., Horgen, K. E., & Borman, W. C. (1998). Situational judgment tests as measures
of knowledge/expertise. Paper presented at the Society for Industrial Organizational
Psychology, Dallas, TX.
Hanson, M. A., & Ramos, R. A. (1996). Situational judgment tests. In R. S. Barrett (Ed.), (1996).
Fair employment strategies in human resource management (pp. 119-124). Westport,
CT: Quorum Books/Greenwood Publishing Group, Inc.
Hogan, J. B. (1994). Empirical keying of background data measures. In G. S. Stokes & M. D.
Mumford (Eds.), Biodata handbook: Theory, research, and use of biographical
18
information in selection and performance prediction (pp. 69-107). Palo Alto, CA: CPP
Books.
Hough, L., & Paullin, C. (1994). Construct-oriented scale construction: The rational approach. In
G. S. Stokes & M. D. Mumford (Eds.), Biodata handbook: Theory, research, and use of
biographical information in selection and performance prediction (pp. 109-145). Palo
Alto, CA: CPP Books.
Lecznar, W. B., & Dailey, J. T. (1950). Keying biographical inventories in classification test
batteries. American Psychologist, 5, 279.
Legree, P. J. (1994). The effect of response format on reliability estimates for tacit knowledge
scales (No. ARI Research Note 94-25). Alexandria, VA: U.S. Army Research Institute
for the Behavioral and Social Sciences.
Lievens, F. (2000). Development of an empirical scoring scheme for situational inventories.
European Review of Applied Psychology/Revue Europeenne de Psychologie Appliquee,
50, 117-125.
Mael, F. A. (1991). A conceptual rationale for the domain and attributes of biodata items.
Personnel Psychology, 44, 763-792.
McDaniel, M. A., Finnegan, E. B., Morgeson, F. P., Campion, M. A., & Braverman, E. P.
(1997). Predicting job performance from common sense. Paper presented at the 12th
annual Society of Industrial Organizational Psychology, St. Louis, MO.
McDaniel, M. A., Morgeson, F. P., Finnegan, E. B., Campion, M. A., & Braverman, E. P.
(2001). Use of situational judgment tests to predict job performance: A clarification of
the literature. Journal of Applied Psychology, 86, 730-740.
19
McDaniel, M. A., & Nguyen, N. T. (2001). Situational judgment tests: A review of practice and
constructs assessed. International Journal of Selection and Assessment, 9, 103-113.
Mead, A. D., & Drasgow, F. (2003). Examination of a resampling procedure for empirical
keying. Paper presented at the 18th Annual Meeting of the Society for Industrial and
Organizational Psychology, Orlando, FL.
Motowidlo, S. J., Dunnette, M. D., & Carter, G. W. (1990). An alternative selection procedure:
The low-fidelity simulation. Journal of Applied Psychology, 75, 640-647.
Motowidlo, S. J., & Tippins, N. (1993). Further studies of the low-fidelity simulation in the form
of a situational inventory. Journal of Occupational and Organizational Psychology, 66,
337-344.
Mumford, M. D., & Owens, W. A. (1987). Methodology review: Principles, procedures, and
findings in the application of background data measures. Applied Psychological
Measurement, 11, 1-31.
Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric Theory (3rd ed.). New York: McGraw-
Hill, Inc.
Owens, W. A. (1976). Background data. In M. D. Dunnette (Ed.), Handbook of Industrial and
Organizational Psychology (1st ed., pp. 609-644). Chicago: Rand McNally.
Phillips, J. F. (1992). Predicting sales skills. Journal of Business and Psychology, 7, 151-160.
Phillips, J. F. (1993). Predicting negotiation skills. Journal of Business and Psychology, 7, 403-
411.
Russell, C. J., & Klein, S. R. (2003). Toward optimization and insight: Bootstrapping a
situational judgment empirical key. Paper presented at the 18th Annual Meeting of the
20
Society for Industrial and Organizational Psychology, Orlando, FL.
Smiderle, D., Perry, B. A., & Cronshaw, S. F. (1994). Evaluation of video-based assessment in
transit operator selection. Journal of Business and Psychology, 9, 3-22.
Stead, N. H., & Shartle, C. L. (1940). Occupational counseling techniques. New York: American
Book.
Sternberg, R. J., & Wagner, R. K. (1993). The g-ocentric view of intelligence and job
performance is wrong. Current Directions in Psychological Science, 2, 1-5.
Sternberg, R. J., Wagner, R. K., & Okagaki, L. (1993). Practical intelligence: The nature and role
of tacit knowledge in work and at school. In J. M. Puckett (Ed.), (1993). Mechanisms of
everyday cognition (pp. 205-227). Hillsdale, NJ, England: Lawrence Erlbaum Associates,
Inc.
Sternberg, R. J., Wagner, R. K., Williams, W. M., & Horvath, J. A. (1995). Testing common
sense. American Psychologist, 50, 912-927.
Stevens, M. J., & Campion, M. A. (1999). Staffing work teams: Development and validation of a
selection test for teamwork settings. Journal of Management, 25, 207-228.
Weekley, J. A., & Jones, C. (1997). Video-based situational testing. Personnel Psychology, 50,
25-49.
Weekley, J. A., & Jones, C. (1999). Further studies of situational tests. Personnel Psychology,
52, 679-700.
21
Table 1
Descriptive Statistics for Predictors and Leadership Criteria Performance Ratings
Calibration
Sample N=144
Cross-Validation Sample N=75
Variable Mean Std. Dev. Mean Std.
Dev. SME Method 4.28 1.39 3.99 1.24
Correlational Method 1 0.20 0.37 .14 .30
Correlational Method 2 0.26 0.27 .22 .22
Vertical % -42.61 58.55 53.79 36.67
Horizontal % 41.06 4.87 128.33 6.65
Mean Criterion 92.72 1.03 93.31 1.46
Unit Weighting 16.49 4.30 16.03 4.66
Leadership Performance Rating 4.42 0.93 4.58 0.98
Note: Correlational Method 1 used predictors significant at the p<.25 level.
Correlational Method 2 used predictors significant at the p<.10 level.
22
Table 2
Correlations between Predictors and Leadership Criteria Performance Ratings
Predictor Calibration Sample N=144
Cross-Validation
Sample N=75
SME Method .15* -.15
Correlational Method 1 .52** .21*
Correlational Method 2 .49** .28**
Vertical % .42** .06
Horizontal % .51** .12
Mean Criterion .61** .07
Unit Weighting .43** .05
Note: *p<.10** p<.05. Correlational Method 1 used predictors significant at the p<.25
level. Correlational Method 2 used predictors significant at the p<.10 level.