research methods volume 9 number 3 the measurement ......samples (n = 1,777) were administered a...

30
The Measurement Equivalence of Web-Based and Paper-and-Pencil Measures of Transformational Leadership A Multinational Test Michael S. Cole University of St. Gallen Arthur G. Bedeian Louisiana State University Hubert S. Feild Auburn University Multigroup confirmatory factor analysis was applied to the responses of 4,909 employees of a multinational organization with locations in 50 countries to examine the measurement equiv- alence of otherwise identical Web-based and paper-and-pencil versions of 20 items compris- ing the transformational leadership component of Bass and Avolio’s Multifactor Leadership Questionnaire. The results supported configural, metric, scalar, measurement error, and rela- tional equivalence across administration modes, indicating that the psychometric properties of the 20 items were similar whether administered as a paper-and-pencil or Web-based measure. Although caution is always advised when considering multiple modes of administration, the results suggest that there are minimal measurement differences for well-developed, psycho- metrically sound instruments applied using either a paper-and-pencil or an online format. Thus, the results open a methodological door for survey researchers wishing to (a) assess transformational leadership with a Web-based platform and (b) compare or combine responses collected with paper-and-pencil and Web-based applications. Keywords: transformational leadership; measurement invariance M ore than a decade has passed since Synodinos and Brennan’s (1990) prediction that, beyond radically changing the nature of business research, interactive computerized software would become the preferred method for collecting and analyzing survey responses. The combined promise of reduced administration expenses and shortened collection-analysis- feedback cycles has undoubtedly encouraged an increasing number of organizations to adopt computer-based research techniques (Hogg, 2002; Thompson, Surface, Martin, & Sanders, 2003). Contrasting opinions regarding the efficacy of computerized (non-Web and Web-based) versus conventional paper-and-pencil surveys have been voiced for more than 339 Organizational Research Methods Volume 9 Number 3 July 2006 339-368 © 2006 Sage Publications 10.1177/1094428106287434 http://orm.sagepub.com hosted at http://online.sagepub.com Authors’ Note: We thank Hettie A. Richardson for vetting an interim draft manuscript. Please address cor- respondence to Michael S. Cole, Institute for Leadership and Human Resource Management, University of St. Gallen, Dufourstrasse 40a, St. Gallen, Switzerland CH-9000; e-mail: [email protected].

Upload: others

Post on 14-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Methods Volume 9 Number 3 The Measurement ......samples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509) was administered a computer-based

The Measurement Equivalence ofWeb-Based and Paper-and-PencilMeasures of TransformationalLeadershipA Multinational TestMichael S. ColeUniversity of St. GallenArthur G. BedeianLouisiana State UniversityHubert S. FeildAuburn University

Multigroup confirmatory factor analysis was applied to the responses of 4,909 employees ofa multinational organization with locations in 50 countries to examine the measurement equiv-alence of otherwise identical Web-based and paper-and-pencil versions of 20 items compris-ing the transformational leadership component of Bass and Avolio’s Multifactor LeadershipQuestionnaire. The results supported configural, metric, scalar, measurement error, and rela-tional equivalence across administration modes, indicating that the psychometric properties ofthe 20 items were similar whether administered as a paper-and-pencil or Web-based measure.Although caution is always advised when considering multiple modes of administration, theresults suggest that there are minimal measurement differences for well-developed, psycho-metrically sound instruments applied using either a paper-and-pencil or an online format.Thus, the results open a methodological door for survey researchers wishing to (a) assesstransformational leadership with a Web-based platform and (b) compare or combine responsescollected with paper-and-pencil and Web-based applications.

Keywords: transformational leadership; measurement invariance

More than a decade has passed since Synodinos and Brennan’s (1990) prediction that,beyond radically changing the nature of business research, interactive computerized

software would become the preferred method for collecting and analyzing survey responses.The combined promise of reduced administration expenses and shortened collection-analysis-feedback cycles has undoubtedly encouraged an increasing number of organizations toadopt computer-based research techniques (Hogg, 2002; Thompson, Surface, Martin, &Sanders, 2003). Contrasting opinions regarding the efficacy of computerized (non-Web andWeb-based) versus conventional paper-and-pencil surveys have been voiced for more than

339

OrganizationalResearch Methods

Volume 9 Number 3July 2006 339-368

© 2006 Sage Publications10.1177/1094428106287434

http://orm.sagepub.comhosted at

http://online.sagepub.com

Authors’ Note: We thank Hettie A. Richardson for vetting an interim draft manuscript. Please address cor-respondence to Michael S. Cole, Institute for Leadership and Human Resource Management, University ofSt. Gallen, Dufourstrasse 40a, St. Gallen, Switzerland CH-9000; e-mail: [email protected].

Page 2: Research Methods Volume 9 Number 3 The Measurement ......samples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509) was administered a computer-based

30 years (Buchanan, 2002; Evan & Miller, 1969; Gosling, Vazire, Srivastava, & John,2004). Our intent is not to revisit the debate surrounding use of computer-based surveys butto empirically address a broader issue that is a precondition for the meaningful comparison(or aggregation) of survey responses elicited using paper-and-pencil and Web-based mea-sures. Moving beyond basic considerations associated with measurement reliability andvalidity, we explore the measurement of paper-and-pencil and Web-based versions of anestablished measure of transformational leadership. In doing so, we place transformationalleadership within a nomological network composed of three theoretically related workplaceconstructs (viz., collective efficacy, work-group cohesiveness, and collective goal commit-ment). Questions associated with measurement equivalence (or, alternatively, measurementinvariance) include whether items asked on paper-and-pencil and Web-based versions of ameasure are conceptually equivalent, whether paper-and-pencil and Web-based opera-tionalizations of underlying theoretical constructs yield equivalent associations, andwhether responses collected using paper-and-pencil and Web-based measures are subject tothe same forms of nonsystematic measurement error.

Previous Research

Whenever a survey measure is converted for use online, it is important for scientificinference, not to mention being an ethical responsibility, to have evidence of measurementequivalence (ethical principles 9.02b and 9.05, American Psychological Association, 2002;see also Naglieri et al., 2004). Buchanan, Johnson, and Goldberg (2005) recently remarkedthat “it is clear that one cannot simply mount an existing [measure] on the World Wide Weband assume that it will be exactly the same instrument” (p. 125). Moreover, lack of evi-dence of measurement invariance weakens conclusions because findings are open to alter-native interpretations (Steenkamp & Baumgartner, 1998).

As was the case in the present study, most Web-based instruments are simple translationsof survey measures developed using a more traditional paper-and-pencil format. Based ontheir review of the literature, Buchanan and colleagues concluded that although such Web-based surveys can be reliable and valid, there is ample empirical evidence to suggest thattheir psychometric properties cannot be taken for granted (e.g., Buchanan, 2002; Buchanan& Smith, 1999). Fouladi, McCarthy, and Moller (2002) reported, for example, that a slightformat modification adversely affected the psychometric properties of a Web-based versionof two established individual-difference measures. Likewise, in a study of the viability of aWeb-based personality inventory based on the Big Five taxonomy, Buchanan et al. (2005)found some items loaded on the “wrong” factors or had unacceptable cross-factor loadings.They concluded that because factorial validities of multidimensional measures are prone tosubtle changes when administered online, not all Web-based and paper-and-pencil versionsof the same measure will be equivalent.

Other evidence suggests, however, that mode of administration may not adversely affectthe comparability of responses to organizationally relevant, job attitude measures. Potoskyand Bobko (1997), for instance, using a sample of undergraduate students (n = 176) as par-ticipants in a simulated personnel-selection process, reported cross-mode correlations

340 Organizational Research Methods

Page 3: Research Methods Volume 9 Number 3 The Measurement ......samples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509) was administered a computer-based

Cole et al. / Measurement and Relational Equivalence 341

(respondents completed both paper-and-pencil and computer-based versions of each measure)of 1.0 when corrected for unreliability. A repeated-measures comparison of mean respon-dent scores also indicated there were no significant differences across administrationmodes. Stanton (1998) investigated the comparability of procedural and distributive justiceitems using two employee samples. Sample 1 (n = 50) completed a Web-based survey,whereas Sample 2 (n = 181) completed a paper-and-pencil instrument. He investigated boththe measurement equivalence of the measures across modes and the independent samples’mean differences. Although use of a covariance analysis framework is generally not rec-ommended for very small samples, Stanton found that the measures’ factor structures, itemloadings, correlations between factors, and latent factor variances were equivalent acrossadministration modes. Moreover, he found that respondents who completed the paper-and-pencil survey reported higher levels of fairness.

Similarly, Donovan, Drasgow, and Probst (2000) used an item-response theory (IRT) tech-nique to examine the measurement equivalence of two dimensions of the Job DescriptiveIndex (viz., supervisor and coworker satisfaction) across three employee samples. Twosamples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509)was administered a computer-based version of the same instrument. Results were mixedwith regard to levels of measurement equivalence across the paper-and-pencil and com-puter administrations. Donovan et al. acknowledged, however, that the university employ-ees who received the computerized instrument were distinctly different from the blue-collaremployees who completed the paper-and-pencil instrument, and therefore, sample character-istics were confounded with modes of administration. In light of this limitation, Donovanet al. recommended that researchers obtain multiple samples from the same organizationto permit an unconfounded assessment of measurement equivalence.

Considered collectively, research on the measurement equivalence of responses col-lected using different administration modes has produced mixed findings and is limited inquantity. For these reasons, Simsek and Veiga (2001) and, more recently, Thompson et al.(2003) have concluded that research exploring measurement invariance across paper-and-pencil and Web-based surveys remains in its infancy. Moreover, as will be discussed, theexisting research is methodologically suspect.

Study Purpose

Based on a review of the methodological challenges facing researchers, Stanton andRogelberg (2001) resolved that research is still needed to fully understand how differentmethods for presenting research stimuli influence survey responses. Despite continued callsto address the issue of measurement invariance (Millsap & Kwok, 2004; Vandenberg, 2002),organizational researchers have seemingly assumed that paper-and-pencil and Web-basedsurveys exhibit adequate cross-mode equivalence. As others have noted, however, a failureto demonstrate measurement equivalence across administration modes not only casts doubton their comparability but also may make their unambiguous interpretation impossible(Vandenberg & Lance, 2000). Therefore, results from earlier studies, whether as simple ascomparing mean scores across administration modes or as complex as comparing or testing

Page 4: Research Methods Volume 9 Number 3 The Measurement ......samples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509) was administered a computer-based

measurement models based on data from paper-and-pencil or Web-based surveys, may be atbest equivocal (Riordan & Vandenberg, 1994) and at worst erroneous (Cheung & Rensvold,2002; Steenkamp & Baumgartner, 1998). Others have noted this concern and similarlyargued that measurement equivalence of Web-based versions of paper-and-pencil surveyscannot be assumed but must be empirically demonstrated (Booth-Kewley, Edwards, &Rosenfeld, 1992; Buchanan & Smith, 1999; Cohen, Swerdlik, & Phillips, 1996).

The importance of measurement equivalence is underscored when it is realized that if,for example, a measure’s factor structure and pattern of factor loadings were found to varyacross paper-and pencil and Web-based applications (and, thus, perhaps be as much afunction of a measure’s administration mode as of its content), as noted above, therewould be no scientific basis for drawing meaningful inferences or establishing the gener-alizability of competing theories. Thus, the general question of measurement invariancemust be addressed when comparing or aggregating data collected using Web-based andpaper-and-pencil surveys, when selecting one administration mode over another or whenswitching between modes. Moreover, in that measurement equivalence highlights thequestion of factorial validity, it is important to examine whether administration modeaffects structural relations among theoretical constructs comprising a nomological network,or what Drasgow (1984) has dubbed “relational equivalence.” In extending Drasgow’slogic, if a measure provides equivalent relations across administration modes, “it is naturalto ask whether [a measure’s] scores have comparable relations with other important vari-ables” (p. 134).

Beyond the preceding considerations, as noted by Ryan, Chan, Ployhart, and Slade(1999), there has been little research directed toward the measurement of job attitudes inmultinational organizations. The purpose of the present investigation was therefore toexamine measurement equivalence issues and to do so within the context of an interna-tional study involving employees representing 50 countries. A complete understanding ofmeasurement issues associated with multinational research and further advancement ofmanagement as an academic discipline requires that the psychometric properties of instru-ments developed in one country be evaluated in other countries to establish that they arecross-nationally invariant. As Steenkamp and Baumgartner (1998) observed, cross-nationaldifferences in measurement properties may, for example, be due to actual differencesamong respondents from different countries on an underlying theoretical construct or “maybe due to systematic biases in the way people from different countries respond to certainitems” (p. 78).

Guided by the prevailing assumptions that transformational leadership will (a) continueto exist as a focal research topic, (b) be of growing interest to international researchers, and(c) be increasingly assessed using Web-based methods, we surmised that an overdue con-tribution to the measurement literature would be to empirically test if the most frequentlyused measure of transformational leadership, taken from Bass and Avolio’s (2000)Multifactor Leadership Questionnaire (TL-MLQ), exhibited comparable psychometricproperties across paper-and-pencil and Web-based surveys. Moreover, as an extension ofprevious measurement invariance research, we sought to examine whether administrationmode affects whether TL-MLQ scores have equivalent relations with three theoreticallyrelated workplace constructs (viz., collective efficacy, work-group cohesiveness, andcollective goal commitment). Evidence that (a) commonly accepted measures of theoretical

342 Organizational Research Methods

Page 5: Research Methods Volume 9 Number 3 The Measurement ......samples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509) was administered a computer-based

Cole et al. / Measurement and Relational Equivalence 343

constructs are cross-nationally equivalent and (b) the structural relations among these constructsas elements comprising transformational leadership’s nomological network are comparableacross divisions of a multinational organization would support use of either paper-and-pencilor Web-based versions of the Bass and Avolio’s transformational leadership measure ininternational research.

To accomplish our stated purpose, we evaluated the measurement equivalence of other-wise identical Web-based and paper-and-pencil versions of Bass and Avolio’s (2000) trans-formational leadership measure as well as established measures of three work-relatedconstructs (identified above) within a multigroup confirmatory factor analysis (MGCFA)framework. In doing so, we sought to establish their measurement equivalence within ahypothesized nomological network composed of transformational leadership (Bass &Avolio, 2000) and collective efficacy, work-group cohesiveness, and collective goal com-mitment (Zaccaro, Rittman, & Marks, 2001).

MGCFA

MGCFA is a powerful and versatile approach for testing measurement equivalenceacross applications (Cheung & Rensvold, 2002; Meade & Lautenschlager, 2004b; Steenkamp& Baumgartner, 1998; Vandenberg & Lance, 2000). As such, it offers several advantages(Vandenberg, 2002). First, the basic CFA framework accounts for measurement errorthrough the inclusion of error terms. Second, by varying constraints across a series ofnested models, MGCFA permits a direct examination of measurement equivalence. Thus,by using MGCFA, we were able to specify constraints in an a priori manner, which allowedus to test for measurement equivalence across administration modes. Finally, MGCFAequivalence analyses, compared to their IRT counterpart, test across groups simultaneouslyand therefore are preferred when multiple groups are being compared (as was the case inseveral of our analyses) and sample sizes are small (Meade & Lautenschlager, 2004a).

An in-depth discussion of the recommended procedures for testing measurement equiv-alence is available in Vandenberg and Lance (2000) and Vandenberg (2002). In brief, whentesting for measurement equivalence across applications (or, as in the present instance,administration modes), sets of parameters are constrained in a logically ordered, increas-ingly restrictive fashion. Little (1997) categorized these measurement equivalence testsinto two types. Measurement-level tests assess (a) configural, (b) metric, (c) scalar, and(d) measurement-error equivalence across applications. Configural equivalence evaluateswhether the conceptual frame of reference used by respondents exposed to different appli-cations are comparable and is operationalized by testing for similarity in the pattern of fac-tor loadings across applications. It is generally considered the weakest form of equivalence.Metric invariance assesses whether factor loadings for like items are equal across applica-tions. Some level of metric invariance must be evident for subsequent tests of measurementequivalence to be interpretable. Scalar (or intercept) equivalence involves testing whether thevector of intercepts for like items is invariant across applications. Scalar equivalence sug-gests that like items have the same operational definition across applications (Cheung &Rensvold, 2002). Finally, measurement-error equivalence assesses whether, across applica-tions, like items tap the same underlying factors with a similar degree of measurement error.

Page 6: Research Methods Volume 9 Number 3 The Measurement ......samples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509) was administered a computer-based

Latent-construct tests assess construct-level equivalence and involve testing for invariantconstruct variances/covariances, as well as between-application differences in latent means(for an in-depth discussion of mean and covariance structures analysis, see Ployhart &Oswald, 2004). Whereas some methodologists have suggested that latent-construct-leveltests should be conducted only in the presence of complete support for configural, metric,and scalar equivalence (e.g., Cheung & Rensvold, 2002), others have advocated relax-ing certain measurement-level constraints and testing latent means under conditions of par-tial scalar equivalence (for a discussion of partial invariance, see Byrne, Shavelson, &Muthén, 1989; Cheung & Rensvold, 1999; Millsap & Kwok, 2004; Steenkamp &Baumgartner, 1998).

To guide the present study, we followed the sequence of test procedures recommendedby Vandenberg and Lance (2000; Vandenberg, 2002). Specifically, we compared the num-ber of underlying factors and equality of the item-factor loadings for measures of transfor-mational leadership, collective efficacy, work-group cohesiveness, and collective goalcommitment administered in both traditional pencil-and-paper and Web-based versions.Where both configural and metric equivalence were supported across administrationmodes, we tested for scalar equivalence. Finding scalar equivalence across applications, wethen tested for equality of residuals to establish measurement-error equivalence. With itemuniqueness supported across applications, we then tested for relational equivalence byexploring the equality of the relevant latent-construct variances and covariances. Finally,latent-mean comparisons were conducted to test for complete measurement equivalence.

Method

Research Setting

Targeted respondents were employed by a company involved in the manufacture, reno-vation, and service of power generators and equipment. In total, 8,598 employees locatedin 50 countries were invited to participate in the present study. Given that the employeesvaried in their ability to comprehend English, all study measures were translated into16 additional languages by a professional translation service. Linguists followed a blind,back-translation strategy. That is, one linguist translated each measure from English into asecond language; then, another linguist back-translated this second language into English.The two linguists discussed any discrepancies and modified the measures accordingly.

The company was adamant in its desire that every employee receive an invitation to par-ticipate in the study. To ensure that all employees had equal study access, both Web-basedand paper-and-pencil surveys incorporating the focal measures were developed. Everyemployee with a company e-mail address (n = 7,319) received an electronic message fromthe company chief executive officer (CEO) describing the study and providing a link to aportal (hosted by the first author’s university) where a Web-based version of the surveycould be completed in any of 17 languages. In addition, 1,279 employees who did not havea company e-mail address or computer access were sent an identical, paper-and-pencil surveyusing the company’s internal mail service. In all, 4,909 employees chose to voluntarily

344 Organizational Research Methods

Page 7: Research Methods Volume 9 Number 3 The Measurement ......samples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509) was administered a computer-based

Cole et al. / Measurement and Relational Equivalence 345

participate in the study by completing either a Web-based or paper-and-pencil survey,yielding a 57% response rate.

Participants

Web-based sample. In the online sample, 4,244 employees completed surveys for a 58%response rate. The sample consisted of 12% upper-level managers, 29% line managers andteam leaders, and 59% nonmanagerial employees. The majority (75%) were between theages of 30 and 55, and more than half (66%) had worked for the company for more than5 years. Age ranges, organization tenure, job functions, job titles, and survey languagepercentages are shown in Table 1.

Paper-and-pencil sample. In the paper-and-pencil sample, 665 employees completedsurveys for a 52% response rate. The paper-and-pencil sample included 4% upper-levelmanagers, 8% line managers and team leaders, and 88% nonmanagerial employees. Themajority (73%) were between the ages of 30 and 55, and more than half (74%) had workedfor the company for more than 5 years. Age ranges, organization tenure, job functions, jobtitles, and survey language percentages are shown in Table 1.

Although the two samples varied somewhat with regard to demographic characteristics,such differences were anticipated because of the company’s multinational structure andgeographically dispersed workforce. As one would expect, a plurality of upper-level man-agers were located in the company’s state-of-the-art home office, whereas a majority of itsmanufacturing employees were spread worldwide in facilities with varying electronicaccess. It is this difference in access that created the conditions necessary for the presentstudy to be conducted. There is no theoretical reason, however, to believe that such differ-ences would be systematically related to the focal variables of interest (described below)and thus affect relationships among study variables.

Measures

Transformational leadership. Respondents assessed their direct supervisor’s leadershipbehavior using the 20 transformational leadership items of the MLQ-5x-short (Bass &Avolio, 2000), the most frequently used measure of transformational leadership (Antonakis,Avolio, & Sivasubramaniam, 2003). The MLQ-5x-short assesses four dimensions of trans-formational leadership: idealized influence (attributed and behavior), inspirational motiva-tion, intellectual stimulation, and individualized consideration. Using a 5-point responseformat ranging from 1 = not at all to 5 = frequently, if not always, respondents judged howfrequently their supervisors displayed specific leader behaviors. Coefficient alpha for thepaper-and-pencil version was .96; alpha for the Web-based version was .96.

Collective efficacy. Six items from Riggs and Knight’s (1994) collective-efficacy beliefsmeasure were used to substantiate respondents’ individual perceptions of their workgroup’s competence for successful action. Sample items include, “People in my work group

Page 8: Research Methods Volume 9 Number 3 The Measurement ......samples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509) was administered a computer-based

346 Organizational Research Methods

Table 1Demographic Information for Web-Based

and Paper-and-Pencil Samples

Web-Based Sample Paper-and-Pencil SampleVariable (n = 4,244) (n = 665)

Response rate (%) 58.0 52.0

Age in years (%)<25 2.2 3.225-30 10.7 11.131-35 15.8 9.236-40 15.6 17.741-45 14.0 15.646-50 15.0 18.551-55 14.1 12.256-60 7.3 4.7>60 2.9 1.5No answer 2.5 6.3

Organization tenure in years (%)<1 3.9 3.0<5 28.5 16.55-10 19.6 22.311-15 13.1 20.216-20 7.0 8.321-25 9.4 6.026-30 7.8 7.831-35 5.5 6.6>35 3.3 2.9No answer 1.8 6.5

Job function (%)Field service 16.6 9.5Workshops 4.8 51.6Maintenance 8.5 9.6Engineering 22.9 3.2Sales 11.4 3.9Parts and logistics 6.1 3.6Support functions 22.3 5.1Shared services 4.0 .6Other 3.5 12.9

Job title (%)Sector/region/unit management 11.8 3.9First line manager/team leader 29.4 7.5Employee 58.9 88.6

(continued)

have above average ability” and “My work group is not able to perform as well as it should”(reverse scored). Respondents indicated their agreement with these items using a 5-pointresponse format (1 = strongly disagree, 5 = strongly agree). The paper-and-pencil versionalpha was .74; alpha for the Web-based version was .77.

Page 9: Research Methods Volume 9 Number 3 The Measurement ......samples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509) was administered a computer-based

Cole et al. / Measurement and Relational Equivalence 347

Work-group cohesiveness. Six items adapted from Riordan and Weatherly (1999) wereused to gauge employees’ work-group cohesiveness. The items assess the extent to whicha work group’s members agree or disagree that they are willing to work together and arecommitted to completing assigned tasks. Sample items include, “In my work group, mem-bers know that they can depend on one another” and “In my work group, members takeinterest in one another.” Respondents indicated their agreement with these items using a5-point response format (1 = strongly disagree, 5 = strongly agree). Alphas for the paper-and-pencil and the Web-based versions, respectively, were .92 and .93.

Collective goal commitment. Six items from Hollenbeck, Klein, O’Leary, and Wright(1989) were included to gauge respondents’ goal commitment. The items were slightlymodified from their original form. Whereas Hollenbeck et al.’s original items referred tothe individual (e.g., “I am strongly committed to pursuing this goal”), we used a referent-shift (Chan, 1998) so that items referred to the collective (“In my work group, people arestrongly committed to pursuing our goals”). Respondents indicated their agreement withthese items using a 5-point response format (1 = strongly disagree, 5 = strongly agree).Alpha for the paper-and-pencil version was .85; alpha for the Web-based version was .86.

Procedure

Respondents completed all four study measures in the same sequence across both admin-istration modes. The Web-based survey was pilot tested to identify potential incompatibilities

Table 1 (continued)

Web-Based Sample Paper-and-Pencil SampleVariable (n = 4,244) (n = 665)

Survey language (%)Croatian 4.2 35.9Czech 1.5 5.9Danish 1.8 10.7English 39.0 10.5Estonian 0.2 0.2Finnish 0.9 1.2Flemish — 1.8French 13.4 4.8German 15.6 3.2Hungarian 2.3 —Italian 3.7 5.6Japanese 0.1 —Polish 3.0 13.2Portuguese 2.4 —Romanian 2.8 —Spanish 3.8 1.7Swedish 5.2 5.4

Page 10: Research Methods Volume 9 Number 3 The Measurement ......samples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509) was administered a computer-based

among operating systems, Web browsers, and so on. After the pilot test was completed, adate was set for survey administration. An electronic message was sent from the CEO to allemployees who had a company e-mail account. The message contained a brief salutation,a short description of the company’s interest in the research project, the deadline for com-pleting and returning the survey (11 days later), and a link to the Web portal where the surveywas posted. The message also explained that participation was voluntary, and anonymitywas guaranteed. Because this was the first time employees were invited by the company tocomplete a survey online, a description of how data were transmitted and stored on a third-party’s computer server (affiliated with the university conducting the study) was also pro-vided. One week later, a reminder was sent from the CEO’s e-mail account. This reminderthanked everyone who had already participated, provided the Web location where thesurvey was posted, and noted that the survey deadline was approaching.

The company was accustomed to communicating in various languages and had systemsin place for doing so (e.g., employees’ e-mail addresses were grouped by country of resi-dence). This capability allowed for the CEO’s invitation and reminder messages to bespecifically tailored to the predominant language(s) spoken in each home country. Whereasemployees located in the United States received e-mail in English, employees in Switzerland,for example, received e-mail in German, French, Italian, and English.

The paper-and-pencil surveys were delivered via internal company mail. In most loca-tions, several versions of the survey (e.g., the English version and one to three other lan-guages) were made available so that employees could choose the language in which theywere most comfortable conversing. The company announced the survey through internalmemoranda, and local managers were encouraged to ensure that all employees were awareof the survey’s purpose and availability. With regard to the format of the paper-and-pencilsurvey, the software used to develop the Web-based survey provided us with the capacityto also develop a portable document formatted (pdf) file that printed a look-alike printcopy. When completing a traditional paper-and-pencil survey, respondents can preview,review, skip, and even change answers if desired. Likewise, in the present study, respon-dents who completed the Web-based survey were also allowed to preview, review, skip,and change their answers. The look-alike appearance and identical operational featureswere specifically intended to control for confounding effects due to differences in admin-istrative procedures.

To complete the study, respondents to the Web survey were instructed to click a buttonlabeled Submit. Respondents completing paper-and-pencil surveys were instructed to mailthem to the university address of the primary author.

Analyses

For each of the aforementioned measures, tests of measurement equivalence wereconducted in seven sequential phases. In each phase, we employed MGCFA to comparethe covariance matrices of individual measure items. Missing data accounted for roughly3% to 4% of the total number of item (n = 38) responses. Congruent with Stanton (1998),the average number of missing items per paper-and-pencil respondent (M = 1.83) wasslightly higher than that of Web respondents (M = 1.13). To impute missing values, we used

348 Organizational Research Methods

Page 11: Research Methods Volume 9 Number 3 The Measurement ......samples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509) was administered a computer-based

Cole et al. / Measurement and Relational Equivalence 349

the full-information maximum likelihood (FIML) technique. FIML is superior to otherimputation techniques as it gives unbiased estimates of means, variances, and other para-meters (Arbuckle, 1996; Byrne, 2001; Wothke, 2000).

Like standard confirmatory factor analysis, overall model fit in MGCFA is commonlyevaluated for statistical significance in which a nonsignificant chi-square indicates a goodfit. Because of a functional dependence on sample size, for large samples, the chi-square sta-tistic provides a sensitive statistical test but not a practical test of model fit. Consequently, asecond approach is to use practical fit indices as an alternative to chi-square (Hu & Bentler,1998). Accordingly, we employed three additional indices for assessing overall model fit:(a) the comparative fit index (CFI), which compares how much better an implied model fitsas compared to a null model; (b) the Tucker-Lewis index (TLI), which contains a penalty forlack of parsimony; and (c) the root mean square error of approximation (RMSEA), whichadjusts for both sample size and degrees of freedom. With regard to RMSEA, we also cre-ated confidence intervals (LO90 and HI90) as recommended by Byrne (2001).

In testing measurement equivalence, we were interested in making cross-mode compar-isons (i.e., paper and pencil vs. Web based) with regard to the equality of estimated para-meters of two nested models. This procedure involves testing the fit of a series ofincreasingly restrictive models against a baseline model. To determine the degree of equiv-alence, previous research has employed the likelihood ratio test (also known as the chi-square difference test). Chi-square differences between nested models are, however, samplesize dependent (Brannick, 1995). Similarly, the chi-square statistic is known to be an overlysensitive index of model fit for models with numerous constrained parameters fitted onlarge samples (Marsh, Balla, & McDonald, 1988). Nonetheless, because no standard existsfor comparing changes in practical fit indices when constraints are added (Vandenberg &Lance, 2000), researchers using MGCFA have had to rely on the likelihood ratio test toassess whether a constrained model fits obtained data less well than a less-constrainedmodel does. To address this limitation, Cheung and Rensvold (2002) conducted a MonteCarlo simulation to assess differences in practical goodness-of-fit indices under the nullhypothesis of measurement equivalence. Their findings suggested most changes in practi-cal goodness-of-fit indices (∆GFIs) are dependent on both sample size and model com-plexity and are also correlated with overall fit measures. The ∆CFI metric, however, wasfound to be a robust fit statistic when testing MGCFA models. Furthermore, they found thata critical ∆CFI value ≤ – 0.01 indicates that a null hypothesis of equivalence should not berejected. Thus, we tested our equivalence hypotheses using this critical value.

Phase 1 tested for configural equivalence to determine if the patterns of fixed and freefactor loadings were comparable across the paper-and-pencil and Web-based surveys(Vandenberg, 2002). Given that configural equivalence exists, the same number of underly-ing factors would emerge. Differing factor structures across administration modes would beevidence that respondents to the two survey forms used different conceptual frames of refer-ence. As previously noted, a failure to support configural equivalence argues against furthertests of measurement invariance (Vandenberg & Lance, 2000). Phase 2 tested for metricinvariance to determine if factor loadings for items were equal across administration modes(Vandenberg, 2002). In Phase 2, the model tested as a part of Phase 1 was constrained by set-ting the factor loadings to be equal for both administration modes. A loss in fit, as indicated

Page 12: Research Methods Volume 9 Number 3 The Measurement ......samples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509) was administered a computer-based

by the ∆CFI between the restricted model and the model tested in Phase 1, would supportrejecting a null hypothesis of invariant factor loadings (Cheung & Rensvold, 2002).

Phase 3 examined scalar invariance to determine whether, in addition to invariant factorloadings, the vectors of item intercepts were also equal across administration modes(Vandenberg, 2002). Cheung and Rensvold (2002) have indicated that scalar equivalence isa prerequisite for the comparison of latent means because it suggests that underlying item-response scales have the same intervals and zero points. Accordingly, they have contendedthat if scalar equivalence is not supported, any subsequent mean comparisons remainambiguous. As noted, however, others have argued that latent-mean comparisons can stillbe made under partial scalar equivalence (e.g., Byrne et al., 1989; Millsap & Kwok, 2004;Steenkamp & Baumgartner, 1998). Following this latter rationale, if at least partial inter-cept equivalence were present, we compared latent means across administration modes.

If the increasingly restrictive models in Phases 2 and 3 showed no deterioration in fit (i.e.,∆CFI) when compared to the baseline model tested in Phase 1, we moved to Phase 4. In thisphase, we tested the null hypothesis that like items’ residual error variances would be equiv-alent across administration modes. Therefore, in Phase 4, the factor structures, factor loadingparameters, intercepts, and error variances were constrained to be equal for both modes.Because residual variance cannot be attributed to the variance associated with an underlyinglatent variable, equality of error variances across administration modes determines if surveyitems tap their associated latent variables with the same degree of measurement error (Cheung& Rensvold, 2002). Unfamiliarity with scoring formats, differences in vocabulary, idioms,grammar, and reactivity resulting from different administration modes have all been sug-gested as sources of residual nonequivalence (Cheung & Rensvold, 2002; Potosky & Bobko,1997; Stanton, 1998). Consistent with the earlier phases, a change in the CFI of ≥ – 0.01 wastaken as evidence that a null hypothesis of invariant residual variances should be rejected.

The three remaining data analysis phases involved testing for construct-level equiva-lence. Phases 5 and 6 addressed relational equivalence for all four study measures. Cheungand Rensvold (2002) noted that when one’s purpose is to relate a theoretical construct (inour case, transformational leadership) to other constructs (viz., collective efficacy, work-group cohesiveness, and collective goal commitment) in a nomological network, latent-construct variance equivalence must be present. Given support for construct-varianceequivalence (i.e., Phase 5), we tested for equality between network correlations in Phase 6by adding an additional constraint of equal covariances. The finding that both constructvariances and covariances are invariant across administration modes indicates that the cor-relations between latent constructs are equal for both modes (e.g., Steenkamp & Baumgartner,1998; Vandenberg & Lance, 2000). Steenkamp and Baumgartner (1998) have explained,however, that when measures of association between variables are compared across groups,measurement reliabilities should “be about the same so that measurement artifacts do notbias the substantive conclusions” (p. 82). Measurement error equivalence is an indicator ofa measure’s reliability (Vandenberg & Lance, 2000), and therefore, the equivalence proce-dure tested in phase 4 was a prerequisite for testing the relational equivalence of all fourstudy measures.

Finally, in Phase 7, we tested the null hypothesis of equivalent latent means acrossadministration modes. It has been suggested that latent-mean differences are more validthan simple analysis of variance (ANOVA) or t tests because observed mean differences

350 Organizational Research Methods

Page 13: Research Methods Volume 9 Number 3 The Measurement ......samples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509) was administered a computer-based

Cole et al. / Measurement and Relational Equivalence 351

cannot be attributable to a lack of equivalence (e.g., Ployhart & Oswald, 2004; Vandenberg& Lance, 2000). In the present study, the procedure outlined by Byrne (2001) for latent-mean comparisons was followed. As part of this procedure, it was necessary to allow latent-variable means to be freely estimated for one administration mode but to be constrainedequal to zero for the other, which thus served as a “reference group.” Determination ofwhich mode would serve as the reference group is purely arbitrary (Byrne, 2001). We chosethe paper-and-pencil mode to serve this purpose. Thus, the latent means for the Web-basedmode were estimated. Significantly higher mean values would indicate that the Web-basedmode reported higher mean ratings on a given measure. In determining significant differ-ences, the critical ratio follows an approximately normal distribution, with values greaterthan ±1.96 indicating statistical significance (p ≤ .05; Arbuckle & Wothke, 1999).

Results

In addition to skewness, kurtosis, and coefficient alphas for the observed variables, theintercorrelations among all study variables for the Web-based and paper-and-pencil sam-ples are reported in Table 2. Of note are the variables’ internal consistency estimates andthe association between transformational leadership and the three outcomes. The coeffi-cient alphas and intercorrelations among the focal variables appear similar across adminis-tration modes.

The Factor Structure of Bass and Avolio’sTransformational Leadership Measure

Within each subsample (i.e., Web-based and paper-and-pencil), we first investigated thefactor structure of transformational leadership by comparing two measurement models. Inprevious studies, intercorrelations among transformational dimensions were generally highand positive (Lowe, Kroeck, & Sivasubramaniam, 1996). In such situations, whether thesedimensions are empirically distinguishable has been questioned (e.g., Bycio, Hackett, &

Table 2Skewness, Kurtosis, Reliabilities,

and Intercorrelations for All Study Variables

Web-Based Mode Paper-and-Pencil Mode r

Measure Skewness Kurtosis α Skewness Kurtosis α 1 2 3 4

Transformational leadership –.45 –.30 .96 –.39 –.53 .96 — .37** .39** .45**Collective efficacy –.30 .08 .77 –.38 .02 .74 .30** — .68** .67**Work-group cohesiveness –.71 .30 .93 –.70 .12 .92 .42** .70** — .64**Collective goal commitment –.69 .67 .86 –.52 .02 .85 .41** .69** .71** —

Note: Correlations above the diagonal are for the Web-based mode (ns = 4,117 to 4,164), whereas those belowthe diagonal are for the paper-and-pencil mode (ns = 633 to 648).**p < .01.

Page 14: Research Methods Volume 9 Number 3 The Measurement ......samples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509) was administered a computer-based

Allen, 1995; Carless, 1998). Accordingly, the first two models fitted the Web-based data (Model1a) and paper-and-pencil data (Model 1b) to load all 20 transformational items on a globaltransformational leadership construct. Results indicated that the global leadership factor was anacceptable fit to both the Web-based (χ2 = 6126.11, df = 170, p < .001, CFI = .975, TLI = .969,RMSEA = .091, LO90 = .089 and HI90 = .093) and the paper-and-pencil data (χ2 = 904.04,df = 170, p < .001, CFI = .978, TLI = .973, RMSEA = .081, LO90 = .076 and HI90 = .086).

In Models 2a and 2b, we fitted a single higher order model to the data (see Figure 1). Inhis original validation work, Bass (1985) discussed the importance of considering higherorder models. Building from this base, Avolio, Bass, and Jung (1999) reported that asecond-order model successfully explained the covariation among first-order factors repre-senting the transformational dimensions. In support, others have reported results revealing asingle, higher order transformational leadership factor (e.g., Carless, 1998; Den Hartog, VanMuijen, & Koopman, 1997). As shown in Figure 1, items in Models 2a and 2b were fixed toload on their respective factors (individualized consideration, intellectual stimulation, inspi-rational motivation, and idealized influence), and a higher order transformational leadershipfactor was also modeled to explain the covariation among latent first-order factors. Resultsindicated that the higher order model fit the paper-and-pencil data (Model 2b: χ2 = 792.82,df = 166, p < .001, CFI = .981, TLI = .976, RMSEA = .075, LO90 = .070 and HI90 = .081);however, in the Web-based sample, an improper solution, resulting from a negative residualestimate, was obtained. Such improper solutions are frequently encountered in practice, and

352 Organizational Research Methods

Figure 1Second-Order Measurement Model for Transformational Leadership

v1

e1

v13

e1

v11

e3

v4

e4

v19

e7

v16

e4

v17

e5

v12

e4

v10

e2

v9

e1

v8

e4

v7

e3

v14

e2

v6

e2

v3

e3

v5

e1

v20

e8

v18

e6

v15

e3

v2

e2

DDDD

Transformational Leadership

Intellectual stimulation

Individualized consideration

Inspirational motivation

Idealized influence

Note: D = disturbance; v = variable; e = residual error term.

Page 15: Research Methods Volume 9 Number 3 The Measurement ......samples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509) was administered a computer-based

Cole et al. / Measurement and Relational Equivalence 353

numerous approaches have been suggested for handling such offending estimates (seeChen, Bollen, Paxton, Curran, & Kirby, 2001; Lawley & Maxwell, 1971; Dillon, Kumar,& Mulani, 1987). When fitting hierarchical models, Byrne (2001) has noted the importanceof inspecting the higher order portion of a model using the critical ratio difference(CRDIFF) method. Indeed, the critical ratios of differences for the residual variances werefound to be, at times, statistically nonsignificant in the samples (one of six comparisonsin the Web-based sample; three of six comparisons in the paper-and-pencil sample). Basedon these findings, we followed Byrne’s recommendation and constrained the residual vari-ances to be equal. We then fit the revised model (Model 2c) to our data. There was no indi-cation of an improper solution, and the practical fit indices indicated a good fit (χ2 = 4834.45,df = 169, p < .001, CFI = .981, TLI = .976, RMSEA = .081, LO90 = .079 and HI90 = .083).

With the models independently fitted to the data, we then used the chi-square differencetest to determine the best fitting measurement model for the transformational leadershipmeasure. Compared to Model 1a, the revised Model 2c exhibited a better fit, ∆χ2 = 1291.66,∆df = 1, p < .001, and all practical fit indices improved.1 Likewise, results for the paper-and-pencil data (Model 1b – Model 2b) indicated that the single higher order model was abetter fit, ∆χ2 = 111.22, ∆df = 4, p < .001, and the practical fit indices remained virtuallyunchanged. Supporting earlier studies, a second-order hierarchical factor structure providedthe best fit to the data.

Hypothesis Testing for Web-Based VersusPaper-and-Pencil Administration Modes

It is generally recommended that researchers independently inspect each measure’s factorstructure (i.e., configural equivalence) to determine the most appropriate baseline measure-ment model (Byrne, 2001). We first examined the configural equivalence for the higher ordertransformational leadership model across the administration modes. Each practical fit indexwas within acceptable guidelines (CFI = .980, TLI = .976, and RMSEA = .057, LO90 = .055and HI90 = .058), supporting configural equivalence with respect to transformational leader-ship across administration modes (see Table 3). Independent baseline analyses were similarlyconducted for collective efficacy, cohesion, and collective goal commitment. Results shownin Table 3 provide support for each measure’s configural invariance. All practical fit indiceswere within acceptable ranges. For collective efficacy, the CFI = .990 and TLI = .978 werehigh, and the RMSEA = .093 (LO90 = .087 and HI90 = .099) was slightly below the .10guideline. Fit indices for cohesion (CFI = .995, TLI = .989, and RMSEA = .072, LO90 = .067and HI90 = .078) and collective goal commitment (CFI = .995, TLI = .989, and RMSEA =.070, LO90 = .065 and HI90 = .076) also suggested a good fit.

Whereas the prior analyses provided initial support for each measure’s configural equiv-alence when examined independently, a more stringent configural equivalence test wouldincorporate all four measures into one measurement model and simultaneously examine themodel’s fit to the Web-based and paper-and-pencil data. We therefore developed a mea-surement model whereby each measure’s items were fixed to load on its respective latentfactor (no item cross-loadings were permitted), and the four latent factors (higher ordertransformational leadership model, collective efficacy, work-group cohesiveness, and

Page 16: Research Methods Volume 9 Number 3 The Measurement ......samples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509) was administered a computer-based

collective goal commitment) were allowed to covary. Consequently, this four-factormeasurement model was used as the baseline model in the subsequent tests of measurementequivalence.

Phase 1 tested the null hypothesis of an equal number of factors across administrationmodes for the study’s measures. As shown in Table 3, the four-factor model was an excel-lent fit to the data, χ2 = 10,736.5, df = 1317, p < .001, CFI = .983, TLI = .981, RMSEA =.038, LO90 = .038 and HI90 = .039. Given that the fit indices indicated acceptable fit, wefailed to reject the null hypothesis of equal factor structures and therefore proceeded toPhase 2.

Phase 2 tested all four measures across administration modes for equal factor loadings.Because of the single higher order transformational leadership factor, the equality con-straint was imposed on the second-order path coefficients as well. As described earlier,we used Cheung and Rensvold’s (2002) suggested ∆CFI of ≤ – 0.01 as an indicator thatthe constrained model tested in Phase 2 adequately fit the data. Fit indices are reported inTable 3. Inspecting the ∆CFI compared to the baseline model in Phase 1 indicated the CFIvalue did not change, despite the added constraints. Similarly, the TLI and RMSEA valueswere also similar to those reported in Phase 1. Taken together, these results support the con-clusion that the factor loadings associated with their respective measures were invariantacross the Web-based and paper-and-pencil modes. On this basis, we failed to reject the nullhypothesis of equal factor loadings and proceeded to Phase 3 of our analyses.

354 Organizational Research Methods

Table 3Goodness-of-Fit Tests for the Multigroup Confirmatory Factor Analyses

RMSEA 90%Instrument χ2 df CFI TLI RMSEA Confidence Intervals

Separate measurement modelsTransformational leadership 5,691.97*** 339 .980 .976 .057 .055-.058Collective efficacy 781.88*** 18 .990 .978 .093 .087-.099Work-group cohesiveness 481.88*** 18 .995 .989 .072 .067-.078Collective goal commitment 454.19*** 18 .995 .989 .070 .065-.076

Phase 1: Configural equivalenceEqual factor structures 10,736.54*** 1317 .983 .981 .038 .038-.039

Phase 2: Metric equivalenceEqual factor loadings 10,823.21*** 1352 .983 .981 .038 .037-.038

Phase 3: Scalar equivalenceEqual intercepts 11,454.71*** 1390 .982 .981 .038 .038-.039

Phase 4: Residual equivalenceEqual uniquenesses 12,092.53*** 1428 .981 .980 .039 .038-.040

Phase 5: Latent-construct equivalenceEqual latent variances 12,101.55*** 1431 .981 .980 .039 .038-.040

Phase 6: Covariance equivalenceEqual covariances 12,136.45*** 1437 .981 .980 .039 .038-.040

Note: CFI = comparative fit index; TLI = Tucker-Lewis index; RMSEA = root mean error of approximation.***p < .001.

Page 17: Research Methods Volume 9 Number 3 The Measurement ......samples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509) was administered a computer-based

Cole et al. / Measurement and Relational Equivalence 355

In Phase 3, we tested scalar invariance across administration modes by adding the additionalconstraint of equal item intercepts to the already restricted model. As reported in Table 3,the CFI and TLI were .982 and .981, respectively, and the RMSEA value continued to sug-gest a good fit (.038). Inspecting the ∆CFI compared to the baseline model in Phase 1, theCFI change of – 0.001 was less than the critical threshold as recommended by Cheung andRensvold (2002). Consequently, the intercepts of like items were equal across administra-tion modes when regressed on their associated latent factors. These findings suggest that allfour measures exhibited scalar invariance, a necessary precondition for the unambiguouscomparisons of latent means.

Although testing for invariance in measurement error is considered by some to be anoverly restrictive test (Cheung & Rensvold, 1999; Little, 1997), we also applied this addi-tional constraint to our measurement model. Thus, in Phase 4, we tested for invariance inmeasurement error across the items comprising each study measure. In addition to equalfactor loadings and intercepts, the error variance associated with each item in a given mea-sure was constrained to be equal across administration modes. These results are reported inTable 3. All practical fit indices indicated the Phase 4 model exhibited a good fit. Mostimportant, the change in the CFI (∆CFI = – 0.002) indicated there was no deterioration inmodel fit despite increasingly restrictive equality constraints. Consequently, we failed toreject the null hypothesis of equal item residual variance. Having found that the error vari-ances associated with individual items were equivalent across modes (in addition to scalarequivalence in Phase 3), we concluded that all four measures were reliable across bothadministration modes.

In Phase 5, we tested the latent construct invariance of each measure across administra-tion modes by adding the constraint of equal construct variances to the already restrictedmodel of invariant residual variances. As reported in Table 3, the practical fit indices sug-gested the model continued to provide a good fit. An inspection of the ∆CFI compared tothe baseline model in Phase 1 (– 0.002) demonstrated that there was no model degradationdespite the additional constraint of equal construct variances. These findings suggest thatthe range of responses to each measure’s items was comparable across Web-based andpaper-and-pencil modes. Thus, construct variances were invariant across administrationmodes.

With equivalent construct variances established, in Phase 6, we applied the additionalconstraint of invariant covariances to the already highly restricted measurement model.Results are reported in Table 3. Practical fit indices indicated that there was no seriousmodel deterioration despite equality constraints placed on the construct covariances. Alsoof importance, the ∆CFI (– 0.002) compared to the configural model estimated in Phase 1was less than the critical value recommended by Cheung and Rensvold (2002). Consideredcollectively, findings observed in Phases 5 to 6 demonstrated that both construct variancesand covariances were invariant across administration modes. Hence, the latent correlationsamong the four focal constructs were also equivalent across modes. Based on this result,we concluded that relational equivalence existed across the Web-based and paper-and-pencil surveys.

In sum, the present findings provided evidence for the equivalence of the overall mea-surement structure (Phase 1), factor loadings (Phase 2), item intercepts (Phase 3), item

Page 18: Research Methods Volume 9 Number 3 The Measurement ......samples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509) was administered a computer-based

uniquenesses or residuals (Phase 4), factor variances (Phase 5), and factor covariances(Phase 6) across administration modes. Nevertheless, we must again note that our conclu-sions were based on the ∆CFI statistic as recommended by Cheung and Rensvold (2002).Although reported to be a robust index for testing the multigroup invariance of CFA models(Cheung & Rensvold, 2002), an alternative explanation for our finding full equivalence isthat the ∆CFI possesses low power as a model-comparison procedure. In the present appli-cation, however, low power did not seem to be an applicable concern. As shown in Table 3,the CFIs and TLIs remained virtually unchanged, and the RMSEA 90% confidence inter-vals were overlapping across models. Moreover, the standardized regression weights (fac-tor loadings) obtained for the full equivalence model (Phase 6) were similar to the observedfactor loadings for the freely estimated model (Phase 1). Furthermore, as illustrated inTable 4, the differences (Phase 6fully constrained model – Phase 1freely estimated model) between the stan-dardized parameter estimates for the Web-based measures (|0.006|) and the paper-and-pencilmeasures (|0.026|) were small.

Because full equivalence strictly held for all four measures, valid comparisons of latent-factor means were conducted in Phase 7. To determine if mean differences existed, latent-mean comparisons were estimated using the Phase 5 model, but the latent means for thepaper-and-pencil sample were constrained to zero. As shown in Table 5, there were differ-ences between the paper-and-pencil and Web-based latent means. For three of four cases,the Web-based sample reported significantly higher measure scores. Specifically, highermeasure scores were obtained for transformational leadership (meandiff = 0.162, p < .01),work-group cohesiveness (meandiff = 0.093), p < .01), and collective goal commitment(meandiff = 0.083, p < .05). In contrast, the paper-and-pencil sample expressed a higher levelof collective efficacy (meandiff = |0.065|, p < .01) than did the Web-based sample. On thewhole, the mean differences were not large; the largest mean difference was found fortransformational leadership (|0.162|, p < .01). Finally, a review of the practical fit indicesindicated the model fit the data (χ2 = 12,023.2, df = 1427, p < .001, CFI = .981, TLI = .980,RMSEA = .039, LO90 = .038 and HI90 = .040). Given that the practical fit indices werewithin acceptable guidelines, we are confident in our interpretation of the latent-mean esti-mates (see Byrne, 2001, p. 241).

For comparison purposes, we also estimated latent means within the context of a modelthat does not assume full equivalence (as incorrectly assuming equivalence may bias thetest of latent means). When estimating latent means, the attainment of an overidentifiedmodel is only possible with the imposition of additional model constraints (see Byrne,2001, p. 229). As a consequence, we conducted a second latent-means analysis using thescalar equivalence model. As shown at the bottom of Table 5, the latent-mean differenceswere nearly identical to those reported for the full equivalence model.

Confound Analyses

Because the Web-based and paper-and-pencil samples were drawn from multiple coun-tries and surveys were presented in different languages, we assessed each measure’s equiv-alence across countries and languages to rule out the possibility of cross-nationalconfounds. The two samples also appeared to systematically differ with respect to job level.

356 Organizational Research Methods

Page 19: Research Methods Volume 9 Number 3 The Measurement ......samples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509) was administered a computer-based

Cole et al. / Measurement and Relational Equivalence 357

Table 4Standardized Factor Loadings Contrasting the Phase 1

Versus Phase 6 Measurement Equivalence Tests

Phase 1: Phase 6:Configural Equivalencea Covariance Equivalenceb

Web-Based Paper-and-Pencil Web-Based Paper-and-PencilInstrument Sample Sample Sample Sample

TL-MLQ

Second-order factor loadingsIndividualized consideration .938 .964 .960 (–.022) .960 (.004)Intellectual stimulation .936 .953 .941 (–.005) .941 (.012)Inspirational motivation .960 .951 .939 (.021) .939 (.012)Idealized influence .971 .970 .971 (.000) .971 (–.001)

First-order factor loadingsIndividualized consideration

Item 1 .736 .760 .741 (–.005) .741 (.019)Item 2 .652 .648 .654 (–.002) .654 (–.006)Item 3 .703 .652 .698 (.005) .698 (–.046)Item 4 .844 .840 .844 (.000) .844 (–.004)

Intellectual stimulationItem 1 .697 .718 .702 (–.005) .702 (.016)Item 2 .754 .754 .753 (.001) .753 (.001)Item 3 .798 .802 .799 (–.001) .799 (.003)Item 4 .818 .798 .814 (.004) .814 (–.016)

Inspirational motivationItem 1 .674 .677 .673 (.001) .673 (.004)Item 2 .809 .723 .796 (.013) .796 (–.073)Item 3 .776 .787 .778 (–.002) .778 (.009)Item 4 .813 .771 .804 (.009) .804 (–.003)

Idealized influence(attributed and behavior)

Item 1 .826 .786 .821 (.005) .821 (–.035)Item 2 .761 .737 .759 (.002) .759 (–.022)Item 3 .835 .827 .835 (.000) .835 (–.008)Item 4 .775 .771 .773 (.002) .773 (–.002)Item 5 .758 .841 .772 (–.014) .772 (.069)Item 6 .742 .669 .731 (.011) .731 (–.062)Item 7 .695 .629 .687 (.008) .687 (–.058)Item 8 .648 .656 .649 (–.001) .649 (.007)

Collective efficacyItem 1 .534 .524 .529 (.005) .529 (–.005)Item 2 .451 .415 .444 (.007) .444 (–.029)Item 3 .730 .742 .732 (–.002) .732 (.010)Item 4 .776 .787 .771 (.005) .771 (.016)Item 5 .542 .401 .519 (.023) .519 (–.118)Item 6 .700 .637 .690 (.010) .690 (–.053)

(continued)

Page 20: Research Methods Volume 9 Number 3 The Measurement ......samples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509) was administered a computer-based

358 Organizational Research Methods

Table 4 (continued)

Phase 1: Phase 6:Configural Equivalencea Covariance Equivalenceb

Web-Based Paper-and-Pencil Web-Based Paper-and-PencilInstrument Sample Sample Sample Sample

Work-group cohesivenessItem 1 .749 .662 .736 (.013) .736 (–.074)Item 2 .821 .799 .818 (.003) .818 (–.019)Item 3 .862 .842 .855 (.007) .855 (–.013)Item 4 .846 .851 .847 (–.001) .847 (.004)Item 5 .865 .870 .866 (–.001) .866 (.004)Item 6 .808 .819 .810 (–.002) .810 (.009)

Collective goal commitmentItem 1 .785 .816 .789 (–.004) .789 (.027)Item 2 .646 .700 .652 (–.006) .652 (.048)Item 3 .639 .538 .624 (.015) .624 (–.086)Item 4 .752 .694 .744 (.008) .744 (–.050)Item 5 .695 .691 .694 (.001) .694 (–.003)Item 6 .735 .732 .736 (–.001) .736 (–.004)

Note: TL-MLQ = transformational leadership measure. Numbers within parentheses reflect the differencebetween the freely estimated factor loadings and their respective loadings for the completely constrained model.All factor loadings are significant, p < .001.a. All parameters were freely estimated.b. Complete equivalence model. All parameters were constrained to equality.

Table 5Results of Latent Mean Difference Tests

Measure Mean differenceb Standard Error Critical Ratio

Critical valuesa

Transformational leadership 0.162 0.031 5.29**Collective efficacy –0.065 0.023 –2.80**Work-group cohesiveness 0.093 0.033 2.84**Collective goal commitment 0.083 0.033 2.52*

Critical valuesc

Transformational leadership 0.162 0.032 5.11**Collective efficacy –0.073 0.024 –3.00**Work-group cohesiveness 0.094 0.035 2.68**Collective goal commitment 0.079 0.034 2.30*

a. Mean differences for complete equivalence model (Phase 6).b. Paper-and-pencil mode was used as the referent; values reflect mean differences for the Web-based mode.c. Mean differences for scalar equivalence model (Phase 3).*p < .05. **p < .01.

Page 21: Research Methods Volume 9 Number 3 The Measurement ......samples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509) was administered a computer-based

Cole et al. / Measurement and Relational Equivalence 359

The paper-and-pencil sample included employees who did not have a company e-mailaddress, and therefore the sample was primarily composed of lower level, nonmanagerialemployees. As a result, we conducted a series of MGCFAs to evaluate the equivalence ofresponses across these various groups. To the extent that measurement equivalence holdsfor the measures across country, language, and job-level groups, any concerns that involvesample differences confounding the administration-mode results should be lessened.

The organizational samples varied in size, particularly when considering the differingsample sizes across countries. Because MGCFA generally requires moderate to largesample sizes, we limited the analyses to only those samples that exceeded 100 participants.When evaluating cross-national or cross-language measurement equivalence, previousresearchers have suggested that equivalence is supported when factor loadings (metricequivalence) are found equivalent across groups (Ryan et al., 1999). Therefore, based onthe ∆CFI critical value (Cheung & Rensvold, 2002), equal factor loadings across groupswere used as the determinant in the present study to support cross-national and cross-language equivalence.

We first tested cross-national equivalence using data from 13 countries: Australia (n = 106),Canada (n = 128), Croatia (n = 414), Denmark (n = 148), France (n = 508), Italy (n = 186),Poland (n = 210), Romania (n = 117), Spain (n = 101), Sweden (n = 253), Switzerland(n = 753), United Kingdom (n = 328), and the United States (n = 682). Results indicatedthere was no model degradation (∆CFI value was – 0.001) despite the added equality con-straint on the factor loadings when the constrained model was compared to the baselinemodel of configural equivalence. Next, equivalence comparisons across 12 languages wereconducted. Included in this analysis were 12 language versions of the four measures:Croatian (n = 417), Czech (n = 104), Danish (n = 146), English (n = 1,727), French(n = 602), German (n = 684), Italian (n = 196), Polish (n = 214), Portuguese (n = 102),Romanian (n = 117), Spanish (n = 174), and Swedish (n = 258). In comparing the configural(i.e., baseline) and constrained model (i.e., metric equivalence), there was no indication ofmodel degradation (∆CFI value was – 0.002), thus fully supporting metric equivalenceacross languages for each of the four measures.

Following the recommendation of Ryan et al. (1999), we then conducted a cross-national, cross-language comparison using the five largest samples, which included theUnited States (English language only, n = 672), the United Kingdom (English languageonly, n = 326), France (French language only, n = 490), Croatia (Croatian language only,n = 409), and Sweden (Swedish language only, n = 250) samples. We chose not to includeSwitzerland in this analysis because the country does not have one distinct national lan-guage; depending on the geographic location, German, Italian, or French is the languagemost used. Consistent with the previous findings, metric equivalence was fully supportedfor the measures (∆CFI value was – 0.002) across the five cross-national, cross-languagegroups. Finally, in that the paper-and-pencil sample was composed primarily of nonman-agerial employees, only four groups were included in this analysis: upper-level managers(Web-based version, n = 500), line managers and team leaders (Web-based version, n = 1,246),nonmanagerial employees (Web-based version, n = 2,498), and nonmanagerial employees(paper-and-pencil version, n = 589). Once again, the measures exhibited metric equivalence

Page 22: Research Methods Volume 9 Number 3 The Measurement ......samples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509) was administered a computer-based

across the job-level groups (∆CFI = 0.000). In sum, these findings support measurementequivalence across country, language, and job level.

Discussion

The purpose of the present investigation was to examine, within an international context,various measurement equivalence issues related to paper-and-pencil and Web-based ver-sions of Bass and Avolio’s (2000) TL-MLQ, the most frequently used measure of transfor-mational leadership. Moreover, as an extension of previous measurement invarianceresearch, we sought to examine whether administration mode affects whether TL-MLQresponses have equivalent relations with three theoretically related workplace constructs(viz., collective efficacy, work-group cohesiveness, and collective goal commitment).

In line with this purpose, our findings support the need to focus on whether the Web-based administration of well-developed paper-and-pencil measures can result in measure-ment invariance. Because we employed a series of nested statistical tests using MGCFA,we may more confidently conclude that the observed results were valid and not due to mea-surement artifacts. Consequently, several conclusions are worth noting.

First, the underlying factor structures in our study measures were the same across admin-istration modes. Thus, there was no evidence that paper-and-pencil and Web-based versionsof the same measures evoked different conceptual frames of reference. Combined with thefinding that factor loadings across administration modes were equivalent, we conclude thatthe measures tested displayed “strong factorial invariance” (Vandenberg, 2002). Furthermore,respondents completed the measures in one of 17 languages. When working with so manydifferent languages, there is always a concern for translation error (see, e.g., Mullen, 1995,pp. 575-576). Had translation error been influential in the present study, however, configuraland metric invariance would have been absent (Cheung & Rensvold, 2002; Little, 1997).

Second, no evidence of unequal item intercepts was found in any of our study measures.As a result, the observed scalar invariance suggests that these measures possessed equalintervals and zero points. Therefore, each measure was isometrically calibrated acrossadministration modes. Third, it has been suggested that the use of different scoring formatsand, more recently, varying administration modes, might be a source of measurement error(Mullen, 1995; Stanton, 1998). Despite varying dimensionality, length, and use of differingscale anchors (i.e., frequency and agreement), there was no indication that the unique errorvariances associated with the items comprising any of the four study variables differedacross mode. Consequently, the present findings add to a growing body of research thatsuggests online instruments do not unduly influence the psychometric properties of well-developed paper-and-pencil measures (Gosling et al., 2004; Herrero & Meneses, in press;Potosky & Bobko, 1997).

One important methodological issue was our decision to follow Cheung and Rensvold’s(2002) recommendation to use the ∆CFI index to detect a lack of invariance.2 Despite var-ious factors that could be expected to contribute to a lack of invariance, based on the ∆CFIguideline, we concluded that the measures were fully equivalent across administrationmodes. Stated differently, some might question ∆CFI’s ability to detect a lack of invariance

360 Organizational Research Methods

Page 23: Research Methods Volume 9 Number 3 The Measurement ......samples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509) was administered a computer-based

Cole et al. / Measurement and Relational Equivalence 361

when in fact it exists, or what Vandenberg (2002, pp. 142-147) has labeled the “sensitivityissue.” The issue of sensitivity concerning the ∆CFI is notable given that Cheung andRensvold’s (2002) study examined the performance of goodness-of-fit indices with respectto Type I error but not Type II error. This is an understandable concern, but there were twomain reasons for not investigating Type II errors (G. Cheung, personal communication,August 23, 2005). First, there are no commonly accepted measures of effect sizes for dif-ferences in estimated parameters when using covariance structure analysis. Therefore, ameaningful simulation study of Type II errors is impossible unless it is based on arbitrarydifferences in one or more estimated parameters, resulting in study findings that are notgeneralizable to other settings. Second, a major issue associated with Type II errors is theoverpower of the likelihood ratio test used to determine the degree of equivalence. By con-trast, there is little concern about inadequate power in applying the likelihood ratio test.

Concerning the ∆CFI’s sensitivity with regard to the current research, two additionalpoints merit mention. Consider, for example, the reported RMSEA values for each of themeasurement equivalence hypotheses. As shown in Table 3 (Phases 1-6), the 90% confi-dence intervals consistently overlapped, indicating that there was no significant model dete-rioration despite the increasingly restrictive equality constraints that were imposed on ourmeasurement model. These results provide additional evidence that the four measures testedwere equivalent across administration modes, as likewise suggested by the ∆CFI values.Second, using the Social Sciences Citation Index (http://www.isinet.com/products/citation/wos/) database, we located more than 30 articles that referenced Cheung and Rensvold’s(2002) study, with a handful of these employing the ∆CFI guideline to detect a lack of mea-surement equivalence. In several of these articles, researchers reported failing to find sup-port for null hypotheses of measurement equivalence because the ∆CFI exceeded the –0.01threshold (e.g., Asci, Eklund, Whitehead, Kirazci, & Koca, 2005; French & Tait, 2004;Lievens & Anseel, 2004; Mantzicopoulos, French, & Maller, 2004; Ployhart, Wiechmann,Schmitt, Sacco, & Rogg, 2003). Thus, whereas we acknowledge that it may be premature todefinitively conclude that the ∆CFI is the index for establishing measurement equivalence,we have no reason to doubt its sensitivity in detecting a lack of invariance.

A final point worth noting is that latent-mean scores for three of our study measures dif-fered significantly across administration modes. With the exception of collective efficacy,computer-based scores were higher than their paper-and-pencil counterparts. When paststudies have found such differences, paper-and-pencil measures have usually yielded rela-tively more positive attitudes (e.g., Eaton & Struthers, 2002; Stanton, 1998). We believethere are three possible explanations for our contradictory finding.

One possible explanation is that the Web-based application prompted respondents topresent themselves in a socially desirable manner, thus creating a general response set.Richman, Kiesler, Weisband, and Drasgow (1999) referred to social-desirability distortionas “the tendency by respondents, under some conditions and modes of data-collection, toanswer questions in a more socially desirable direction than they would under other condi-tions or modes of administration” (p. 755). For years, it was believed that the anonymous,impersonal, and nonjudgmental nature of Web-based surveys would reduce desirableresponding; past findings, however, are mixed (e.g., Lautenschlager & Flaherty, 1990;Rosenfeld, Booth-Kewley, Edwards, & Thomas, 1996; Whitener & Klein, 1995). In their

Page 24: Research Methods Volume 9 Number 3 The Measurement ......samples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509) was administered a computer-based

meta-analysis of the literature, Richman et al. (1999) reported that in certain situations(e.g., expectations on how data were to be used or being told that responses would be keptin a data file), social-desirability distortion was greater in Web-based as compared to paper-and-pencil surveys. Others have similarly noted that respondents’ beliefs concerning thetrustworthiness and reliability of Web-based data are increasingly becoming adverselyaffected by the publicity given to Internet privacy issues (Simsek & Veiga, 2001; Stanton& Rogelberg, 2001). Simsek and Veiga (2001) suggested, for example, when electronicinvitations to participate in a study are sent to employees’ e-mail accounts (as we did in thecurrent study), a likely response is, “If they know my e-mail address, they also know meand how I am responding” (p. 231).

A second possibility as to why the latent means from the Web based as compared to thepaper-and-pencil application were generally higher involves the type of statistical tests usedin our data analysis. Whereas earlier studies (e.g., Booth-Kewley et al., 1992; Eaton &Struthers, 2002; King & Miles, 1995; Knapp & Kirk, 2003; Stanton, 1998) have typicallyemployed ANOVA or simple t tests when conducting mean comparisons, we tested forlatent-mean differences using mean and covariance structure (commonly referred to asMACS) analysis techniques. As a result, we were able to take into account measurementequivalence, as well as any measurement error. To explore this alternative more fully, wepooled the responses of both the paper-and-pencil and Web-based samples and tested formean differences across applications using post hoc one-way ANOVAs. Results indicatedcollective efficacy and collective goal commitment were no longer significantly differentacross administration modes. This finding supports earlier arguments regarding the advan-tages of covariance analyses; had we not used MACS when testing for latent-mean differ-ences, we would have erroneously concluded there were no differences between administrationmodes on two of four study measures. With the availability of numerous statistical pack-ages offering MGCFA and latent-means analysis options (and texts explaining how to usethem), we recommend that researchers no longer use traditional ANOVA or t tests for makingsuch cross-group comparisons.

A third possibility as to why the means from the Web-based as compared to the paper-and-pencil application were higher concerns the nonrandom assignment of respondentsacross administration modes. The nonmanagerial nature of the paper-and-pencil samplemight, for example, account for the observed lower latent means. We acknowledge thispotential limitation but note that in a real-world context, as in the present study, organiza-tions typically do not have the luxury of randomly assigning employees across experimen-tal conditions. More typically, organizations strive for the largest and most representativeemployee sample possible. This may include administering both paper-and-pencil andWeb-based surveys. As a further attempt to address this concern, we conducted a post hoclatent-means comparison test that compared a nonmanagerial subsample (n = 559; selectedat random from the Web-based sample) and the paper-and-pencil sample of nonmanagerialemployees (n = 559). The results were entirely consistent with the findings we report forour complete samples.

A second study limitation is likewise related to the nonrandom assignment of respon-dents. More specifically, the sample sizes across respondent groups differed greatly, possi-bly influencing parameter standard errors, which could bias nested model comparisons. We

362 Organizational Research Methods

Page 25: Research Methods Volume 9 Number 3 The Measurement ......samples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509) was administered a computer-based

Cole et al. / Measurement and Relational Equivalence 363

again note that the purpose of the present study was to investigate measurement equiva-lence issues using data collected in a real-world context. This limitation remains, and there-fore, we again conducted post hoc equivalence analyses that included identical sample sizesfor both administration modes. Supporting the results based on the complete samples, eachmeasure was found to be fully equivalent across the Web-based and paper-and-penciladministrations.3

With few exceptions (e.g., Booth-Kewley et al., 1992; Donovan et al., 2000), past researchstudies comparing conventional paper-and-pencil and Web-based applications have usedundergraduate student samples or recruited (e.g., links placed on Web pages, listservs,search engines, and e-mail snowballing) Internet users (e.g., Buchanan & Smith, 1999;Epstein, Klinkenberg, Wiley, & McKinley, 2001; Herrero & Meneses, in press; King &Miles, 1995; Knapp & Kirk, 2003; Potosky & Bobko, 1997). Although these types of Internetapproaches offer certain advantages (e.g., large sample sizes), they also raise concerns aboutthe representativeness of nonrandom samples and the generalizability of the resulting find-ings to other settings (Kraut et al., 2004). The present study makes a contribution to themeasurement literature given that actual employees were sampled as part of an organization-sponsored survey. It thus extends the generalizability of available knowledge to a real-world(workplace) context. We note, however, that respondents in the present study were guaran-teed complete anonymity. An alternative survey methodology would have been to assurerespondent confidentiality. The difference between these alternatives is that in confidentialsurveys, respondents can be identified. Consequently, confidential surveys may leadrespondents to sanitize their responses for fear of negative evaluations or other forms ofreprisal. Additional research is needed, therefore, to determine if administration mode anddegree of anonymity interact to influence measurement equivalence.

Although our primary focus has been the psychometric properties of the TL-MLQ acrossadministration modes, demonstrating that the TL-MLQ measure and its associated outcomeswere equivalent across a wide variety of nations and languages is notable and, we suggest,may also hold interesting implications for future research and practice. As Scandura notedin her letter exchange with Dorfman, “The need to discern how perceptions of leadershipmay vary from one culture to another and the implications for understanding seem morepressing than ever in the context of leader emergence today” (Scandura & Dorfman, 2004,p. 280). Accordingly, by our assessment of the TL-MLQ in 13 countries and 12 languages,our results extend international and cross-cultural leadership research (e.g., Den Hartog,House, Hanges, Ruiz-Quintanilla, & Dorfman, 1999), as we have demonstrated that ratersof leadership behavior invoked a similar conceptual frame of reference with regard to Bassand Avolio’s transformational leadership measure. Furthermore, our results are supportiveof Bass’s (1997) contention that it may be possible for a single transformational leadershiptheory to account for leadership across an assortment of differing cultures. Interestingly, pre-vious research that has examined the equivalence of measures oftentimes concludes that theconstructs are not equivalent across cultures (e.g., Riordan & Vandenberg, 1994; Ryan et al.,1999). And yet, the present effort adds to the growing body of research (e.g., the GLOBEProject on Leadership Worldwide; Den Hartog et al., 1999) that has demonstrated conver-gence in cross-national perceptions of leadership. Hence, we echo Scandura’s entreaty thatthe next step will be for researchers to investigate why this so often appears to be the case

Page 26: Research Methods Volume 9 Number 3 The Measurement ......samples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509) was administered a computer-based

(Scandura & Dorfman, 2004). Moreover, leadership researchers are becoming increasinglyinterested in how leaders exert their influence on followers’ job attitudes and behavior. Thus,the analytic approach used in the present study may provide researchers with a powerful toolto investigate the relational equivalence between the TL-MLQ and associated outcomesacross cultures and languages. Finally, our results suggest that researchers and practitionersinterested in transformational leadership assessment and training and development wouldbenefit from separately assessing lower and higher order transformational leadership con-structs. If future research were to find, for example, that the influence of individualized con-sideration was similar across cultures, such a finding would hold both theoretical andpractical implications for multinational corporations in such matters as expatriate training.

There is still much for survey researchers to learn concerning how to design, interpret, andaccount for responses obtained using multiple modes of administration. The current studywas conducted to provide a better understanding of measurement equivalence when “com-puterizing” a well-developed, paper-and-pencil measure. It is unique in that survey data werecollected from respondents in a multinational organization that employed both paper-and-pencil and Web-based surveys to ensure a sample representative of the organization as awhole. Moreover, it extends prior measurement equivalence research on Web-based andpaper-and-pencil surveys by exploring the practical question of whether modes of adminis-tration affect the strength of the relations among job attitudes and, consequently, how surveyresults are interpreted by an organization. In sum, our study demonstrates that MGCFA is auseful tool for testing measurement equivalence across research samples and administrationmodes. Few researchers have explored differences between Web-administered and tradi-tional paper-and-pencil responses collected from actual employees voluntarily participatingin an organization-sponsored field study. Although our research incorporated only four mea-sures, our results are consistent with past investigations that have reported few differencesbetween participants’ response patterns (e.g., Donovan et al., 2000; Stanton, 1998).Whereas caution is always advised when using multiple modes of administration, our resultssuggest that, in general, there are minimal measurement differences for well-developed, psy-chometrically sound instruments whether administered in a paper-and-pencil or online format.In particular, our findings also open a methodological door for survey researchers wishingto assess transformational leadership with a Web-based platform, as well as to compare orcombine responses collected with paper-and-pencil and Web-based applications.

Notes

1. Alternative options for dealing with the offending estimate include removing items (Lawley & Maxwell,1971) and setting the residual variance to zero or some other small value (Chen et al., 2001). We chose not tomake use of these alternatives for several reasons. We believed deletion of items to be inappropriate given thatwe would need to delete items in both samples, even though the offending estimate was found in only the Web-based sample. The second alternative, that is, constraining the negative variance to zero, was also deemed to beunsuitable because this option implies the additional constraint is imposed only on the offending estimate in theWeb-based sample. Thus, the baseline models used in subsequent equivalence tests would have been slightlydifferent. Furthermore, the CRDIFF analysis indicated there were instances in both samples involving equalresidual variances that, as Byrne (2001) has suggested, are prime candidates for equality constraints. Despiteour application of Byrne’s recommendation, the equivalence tests were conducted using both alternatives, andthe results were nearly identical. These results can be obtained by contacting the first author.

364 Organizational Research Methods

Page 27: Research Methods Volume 9 Number 3 The Measurement ......samples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509) was administered a computer-based

Cole et al. / Measurement and Relational Equivalence 365

2. The ensuing discussion on the ∆comparative fit index’s sensitivity was primarily guided by an anonymousreferee’s comments.

3. The paper-and-pencil sample consisted of the 665 respondents used in the previous analyses. Using theselect cases at random function in SPSS, 5 random samples of 665 respondents were drawn from the 4,244employees who competed the measures using the Web-based survey, and equivalence analyses were indepen-dently conducted across the paper-and-pencil sample and each of the five Web-based subsamples. In each of theequivalence tests, there was no indication of model degradation when additional equality constraints wereadded to the baseline models.

References

American Psychological Association. (2002). Ethical principles of psychologists and code of conduct. AmericanPsychologist, 57, 1060-1073.

Antonakis, J., Avolio, B. J., & Sivasubramaniam, N. (2003). Context and leadership: An examination of thenine-factor full-range leadership theory using the multifactor leadership questionnaire. Leadership Quarterly,14, 261-295.

Arbuckle, J. L. (1996). Full information estimation in the presence of missing data. In G. A. Marcoulides &R. E. Schumaker (Eds.), Advanced structural equation modeling: Issues and techniques (pp. 243-277).Mahwah, NJ: Erlbaum.

Arbuckle, J. L., & Wothke, W. (1999). Amos 4.0 user’s guide. Chicago: Small Waters.Asci, F. H., Eklund, R. C., Whitehead, J. R., Kirazci, S., & Koca, C. (2005). Use of the CY-PSPP in other cul-

tures: A preliminary investigation of its factorial validity for Turkish children and youth. Psychology of Sportand Exercise, 6, 33-50.

Avolio, B. J., Bass, B. M., & Jung, D. I. (1999). Re-examining the components of transformational and trans-actional leadership using the Multifactor Leadership Questionnaire. Journal of Occupational and OrganizationalPsychology, 72, 441-462.

Bass, B. M. (1985). Leadership and performance beyond expectations. New York: Free Press.Bass, B. M. (1997). Does the transactional-transformational leadership paradigm transcend organizational and

national boundaries? American Psychologist, 52, 130-139.Bass, B. M., & Avolio, B. J. (2000). Multifactor Leadership Questionnaire: Technical report, leader form, rater

form, and scoring key for MLQ form 5x-short (2nd ed). Redwood City, CA: Mindgarden.Booth-Kewley, S., Edwards, J. E., & Rosenfeld, P. (1992). Impression management, social desirability, and

computer administration of attitude questionnaires: Does the computer make the difference? Journal ofApplied Psychology, 77, 562-566.

Brannick, M. T. (1995). Critical comments on applying covariance structure modeling. Journal of OrganizationalBehavior, 16, 201-213.

Buchanan, T. (2002). Online assessment: Desirable or dangerous? Professional Psychology: Research andPractice, 33, 148-154.

Buchanan, T., Johnson, J. A., & Goldberg, L. R. (2005). Implementing a five-factor personality inventory foruse on the Internet. European Journal of Psychological Assessment, 21, 115-127.

Buchanan, T., & Smith, J. L. (1999). Using the Internet for psychological research: Personality testing on theWorld Wide Web. British Journal of Psychology, 90, 125-144.

Bycio, P., Hackett, R. D., & Allen, J. S. (1995). Further assessments of Bass’s (1985) conceptualization of trans-actional and transformational leadership. Journal of Applied Psychology, 80, 468-478.

Byrne, B. M. (2001). Structural equation modeling with AMOS: Basic concepts, applications, and program-ming. Mahwah, NJ: Erlbaum.

Byrne, B. M., Shavelson, R. J., & Muthén, B. (1989). Testing for equivalence of factor covariance and meanstructures: The issue of partial measurement invariance. Psychological Bulletin, 105, 456-466.

Carless, S. A. (1998). Assessing the discriminant validity of transformational leadership behavior as measuredby the MLQ. Journal of Occupational and Organizational Psychology, 71, 353-358.

Page 28: Research Methods Volume 9 Number 3 The Measurement ......samples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509) was administered a computer-based

Chan, D. (1998). Functional relations among constructs in the same content domain at different levels of analysis:A typology of composition models. Journal of Applied Psychology, 83, 234-246.

Chen, F., Bollen, K. A., Paxton, P., Curran, P. J., & Kirby, J. B. (2001). Improper solutions in structural equa-tion models: Causes, consequences, and strategies. Sociological Methods and Research, 29, 468-508.

Cheung, G. W., & Rensvold, R. B. (1999). Testing factorial invariance across groups: A reconceptualization andproposed new method. Journal of Management, 25, 1-27.

Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-of-fit indexes for testing measurement invari-ance. Structural Equation Modeling, 9, 233-255.

Cohen, R. J., Swerdlik, M. E., & Phillips, S. M. (1996). Psychological testing and assessment (3rd ed.).Mountain View, CA: Mayfield.

Den Hartog, D. N., House, R. J., Hanges, P. J., Ruiz-Quintanilla, S. A., & Dorfman, P. W. (1999). Culture spe-cific and cross-culturally generalizable implicit leadership theories: Are attributes of charisma/transforma-tional leadership universally endorsed? Leadership Quarterly, 10, 219-256.

Den Hartog, D. N., Van Muijen, J. J., & Koopman, P. L. (1997). Transactional versus transformational leader-ship: An analysis of the MLQ. Journal of Occupational and Organizational Psychology, 70, 19-34.

Dillon, W. R., Kumar, A., & Mulani, N. (1987). Offending estimates in covariance structure analysis:Comments on the causes of and solutions to Heywood cases. Psychological Bulletin, 101, 126-135.

Donovan, M. A., Drasgow, F., & Probst, T. M. (2000). Does computerizing paper-and-pencil job attitude scalesmake a difference? New IRT analyses offer insight. Journal of Applied Psychology, 85, 305-313.

Drasgow, F. (1984). Scrutinizing psychological tests: Measurement equivalence and equivalent relations withexternal variables are the central issue. Psychological Bulletin, 95, 134-135.

Eaton, J., & Struthers, C. W. (2002). Using the internet for organizational research: A study of cynicism in theworkplace. CyberPsychology and Behavior, 5, 305-313.

Epstein, J., Klinkenberg, W. D., Wiley, D., & McKinley, L. (2001). Insuring sample equivalence across Internetand paper-and-pencil assessments. Computers in Human Behavior, 17, 339-346.

Evan, W. M., & Miller, J. R. (1969). Differential effects on response bias of computer vs. conventional admin-istration of a social science questionnaire: An exploratory methodological experiment. Behavioral Science,14, 216-227.

Fouladi, R. T., McCarthy, C. J., & Moller, N. P. (2002). Paper-and-pencil or online? Evaluating mode effects onmeasures of emotional functioning and attachment. Assessment, 9, 204-215.

French, D. J., & Tait, R. J. (2004). Measurement invariance in the General Health Questionnaire-12 in youngAustralian adolescents. European Child and Adolescent Psychiatry, 13, 1-7.

Gosling, S. D., Vazire, S., Srivastava, S., & John, O. P. (2004). Should we trust Web-based studies? A compar-ative analysis of six preconceptions about Internet questionnaires. American Psychologist, 59, 93-104.

Herrero, J., & Meneses, J. (in press). Short Web-based versions of the Perceived Stress (PSS) and Center forEpidemiological Studies–Depression (CESD) scales: A comparison to pencil-and-paper responses amongInternet users. Computers in Human Behavior.

Hogg, A. (2002). Conducting online research. Burk Incorporated White Paper Series, 3(2). Retrieved July 8,2004, from http://www.burke.com/whitepapers/PDF/B.WhitePaperVol3-2002-Iss2.pdf

Hollenbeck, J. R., Klein, H. J., O’Leary, A. M., & Wright, P. M. (1989). Investigation of the construct validityof a self-report measure of goal commitment. Journal of Applied Psychology, 74, 951-956.

Hu, L., & Bentler, P. M. (1998). Fit indices in covariance structure modeling: Sensitivity to underparameterizedmodel misspecification. Psychological Methods, 3, 424-453.

King, W. C., Jr., & Miles, E. W. (1995). A quasi-experimental assessment of the effect of computerizing noncog-nitive paper-and-pencil measurements: A test of measurement equivalence. Journal of Applied Psychology,80, 643-651.

Knapp, H., & Kirk, S. A. (2003). Using pencil and paper, Internet, and touch-tone phones for self-administeredsurveys: Does methodology matter? Computers in Human Behavior, 19, 117-134.

Kraut, R., Olson, J., Banaji, M., Bruckman, A., Cohen, J., & Couper, M. (2004). Psychological research online:Report of board of scientific affairs’ advisory group on the conduct of research on the Internet. AmericanPsychologist, 59, 105-117.

Lautenschlager, G. J., & Flaherty, V. L. (1990). Computer administration of questions: More desirable or moresocial desirability? Journal of Applied Psychology, 75, 310-314.

366 Organizational Research Methods

Page 29: Research Methods Volume 9 Number 3 The Measurement ......samples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509) was administered a computer-based

Cole et al. / Measurement and Relational Equivalence 367

Lawley, D. N., & Maxwell, A. E. (1971). Factor analysis as a statistical method (2nd ed.). London: Butterworth.Lievens. F., & Anseel, F. (2004). Confirmatory factor analysis and invariance of an organizational citizenship

behavior measure across samples in a Dutch-speaking context. Journal of Occupational and OrganizationalPsychology, 77, 299-306.

Little, T. D. (1997). Mean and covariance structures (MACS) analyses of cross-cultural data: Practical and the-oretical issues. Multivariate Behavioral Research, 32, 53-76.

Lowe, K. B., Kroeck, K. G., & Sivasubramaniam, N. (1996). Effectiveness correlates of transformation andtransactional leadership: A meta-analytic review of the MLQ literature. Leadership Quarterly, 7, 385-425.

Mantzicopoulos, P., French, B. F., & Maller, S. J. (2004). Factor structure of the pictorial scale of perceivedcompetence and social acceptance with two pre-elementary samples. Child Development, 75, 1214-1228.

Marsh, H. W., Balla, J. R., & McDonald, R. P. (1988). Goodness-of-fit indexes in confirmatory factor analysis:The effect of sample size. Psychological Bulletin, 103, 391-410.

Meade, A. W., & Lautenschlager, G. J. (2004a). A comparison of item response theory and confirmatory factoranalytic methodologies for establishing measurement equivalence/invariance. Organizational ResearchMethods, 7, 361-388.

Meade, A. W., & Lautenschlager, G. J. (2004b). A Monte-Carlo study of confirmatory factor analytic tests ofmeasurement equivalence/invariance. Structural Equation Modeling, 11, 60-72.

Millsap, R. E., & & Kwok, O. (2004). Evaluating the impact of partial factorial invariance on selection in twopopulations. Psychological Methods, 9, 93-115.

Mullen, M. R. (1995). Diagnosing measurement equivalence in cross-national research. Journal of InternationalBusiness Studies, 26, 573-596.

Naglieri, J. A., Drasgow, F., Schmit, M., et al. (2004). Psychological testing on the Internet: New problems, oldissues. American Psychologist, 59, 150-162.

Ployhart, R. E., & Oswald, F. L. (2004). Applications of mean and covariance structure analysis: Integratingcorrelational and experimental approaches. Organizational Research Methods, 7, 27-65.

Ployhart, R. E., Wiechmann, D., Schmitt, N., Sacco, J. M., & Rogg, K. (2003). The cross-cultural equivalenceof job performance ratings. Human Performance, 16, 49-79.

Potosky, D., & Bobko, P. (1997). Computer versus paper-and-pencil administration mode and response distor-tion in noncognitive selection tests. Journal of Applied Psychology, 82, 293-299.

Richman, W. L., Kiesler, S., Weisband, S., & Drasgow, F. (1999). A meta-analytic study of social desirabilitydistortion in computer-administered questionnaires, traditional questionnaires, and interviews. Journal ofApplied Psychology, 84, 754-775.

Riggs, M. L., & Knight, P. A. (1994). The impact of perceived group success-failure on motivational beliefs andattitudes: A causal study. Journal of Applied Psychology, 79, 755-766.

Riordan, C. M., & Vandenberg, R. J. (1994). A central question in cross-cultural research: Do employees ofdifferent cultures interpret work-related measures in an equivalent manner? Journal of Management, 20,643-671.

Riordan, C. M., & Weatherly, E. W. (1999). Defining and measuring employees’ identification with their workgroups. Educational and Psychological Measurement, 59, 310-324.

Rosenfeld, P., Booth-Kewley, S., Edwards, J. E., & Thomas, M. D. (1996). Responses on computer surveys:Impression management, social desirability, and the Big Brother syndrome. Computers in Human Behavior,12, 263-274.

Ryan, A. M., Chan, D., Ployhart, R. E., & Slade, L. A. (1999). Employee attitude surveys in a multinationalorganization: Considering language and culture in assessing measurement equivalence. Personnel Psychology,52, 37-58.

Scandura, T., & Dorfman, P. (2004). Leadership research in an international and cross-cultural context.Leadership Quarterly, 15, 277-307.

Simsek, Z., & Veiga, J. F. (2001). A primer on Internet organizational surveys. Organizational ResearchMethods, 4, 218-235.

Stanton, J. M. (1998). An empirical assessment of data collection using the Internet. Personnel Psychology, 51,709-725.

Stanton, J. M., & Rogelberg, S. G. (2001). Using Internet/Intranet Web pages to collect organizational researchdata. Organizational Research Methods, 4, 200-217.

Page 30: Research Methods Volume 9 Number 3 The Measurement ......samples (n = 1,777) were administered a paper-and-pencil instrument, and the third (n = 509) was administered a computer-based

Steenkamp, J. B. E. M., & Baumgartner, H. (1998). Assessing measurement invariance in cross-national consumerresearch. Journal of Consumer Research, 25, 78-90.

Synodinos, N. E., & Brennan, J. M. (1990). Evaluating microcomputer interactive survey software. Journal ofBusiness and Psychology, 4, 483-493.

Thompson, L. F., Surface, E. A., Martin, D. L., & Sanders, M. G. (2003). From paper to pixels: Moving per-sonnel surveys to the Web. Personnel Psychology, 56, 197-227.

Vandenberg, R. J. (2002). Toward a further understanding of and improvement in measurement invariancemethods and procedures. Organizational Research Methods, 5, 139-158.

Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature:Suggestions, practices, and recommendations for organizational research. Organizational Research Methods,3, 4-70.

Whitener, E. M., & Klein, H. J. (1995). Equivalence of computerized and traditional research methods: Theroles of scanning, social environment, and social desirability. Computers in Human Behavior, 11, 65-75.

Wothke, W. (2000). Longitudinal and multigroup modeling with missing data. In T. D. Little, K. U. Schnable,& J. Baumert (Eds.), Modeling longitudinal and multilevel data (pp. 219-240). Mahwah, NJ: Erlbaum.

Zaccaro, S. J., Rittman, A. L., & Marks, M. A. (2001). Team leadership. Leadership Quarterly, 12, 451-483.

Michael S. Cole is a senior research fellow and lecturer in the Institute for Leadership and Human ResourceManagement at the University of St. Gallen in Switzerland. His research interests include exploring how orga-nizational contextual factors influence employees’ attachments to organizations.

Arthur G. Bedeian is a Boyd Professor at Louisiana State University. He is a former editor of the Journal ofManagement, a past president of the Academy of Management, and a quondam dean of the Academy’s FellowsGroup.

Hubert S. Feild is Torchmark Professor of Management in the College of Business at Auburn University. Hisprofessional interests include human resource selection and research methods in human resource management.

368 Organizational Research Methods