the reliability of measuring organizational maturity · software process maturity and software...

THE RELIABILITY OF MEASURING

ORGANIZATIONAL MATURITY

KHALED EL EMAM

NAZIM H. MADHAVJI

- 1 -

To appear in Software Process Improvement and Practice, John Wiley & Sons, 1995.

THE RELIABILITY OF MEASURING

ORGANIZATIONAL MATURITY✧

KHALED EL EMAM=

NAZIM H. MADHAVJI*==CENTRE DE RECHERCHE INFORMATIQUE DE MONTREAL (CRIM)

*SCHOOL OF COMPUTER SCIENCE, MCGILL UNIVERSITY

Abstract

One of the recently developed classes of decision making tools for software engineering

management is the organizational maturity assessment. The scores from such assessments are being

applied in focusing and tracking self-improvement efforts, and as part of the contract award decision

making process. However, until now, the important issue of the reliability of assessments has rarely

been addressed by the developers and users of assessment methods. The extent of reliability

describes the degree to which assessment scores are consistent and repeatable. In this paper we

present some basic techniques for the estimation of the reliability of maturity assessments. We then

demonstrate through a case study some of these techniques and how to apply the reliability

estimate(s) in decision making situations. Examples of decisions are: comparing organizations'

maturity scores and evaluating the relationship between maturity scores and some criterion of

effectiveness. For the latter, we found a weak relationship betwen maturity and our measure of

effectiveness, namely the success of the requirements engineering process.

Keywords: software process assessment, organizational maturity, reliability, empirical study.

- 2 -

✧ This work was supported, in part, by the IT Macroscope Project and NSERC Canada.

1 Introduction

1.1 Use of Maturity Assessments and Assessment Scores

In recent years there has been a marked increase in the number of methods for assessing the

software process maturity and software process capability of organizations1 developing software

and/or developing software-based systems. The basic premise behind these methods is that

assessment scores are positively associated with project and organizational effectiveness (e.g.,

productivity, software quality, user satisfaction, etc.).

Most prominent amongst the assessment methods is the Software Capabil i ty

Evaluation (SCE) (SEI 1994a; 1994b) based on the Capability Maturity Model (CMM) for

software developed by the SEI (Paulk et al. 1993a; 1993b). Other methods and models exist2, for

example, Trillium (Bell Canada 1992; Coallier 1995), Bootstrap (Koch 1993; Kuvaja et al. 1994),

the current effort on the SPICE project (Dorling 1993; Drouin 1994a; 1994b), and variants of the

CMM based methods (Drew 1992; Thompson et al. 1992). Furthermore, some assessment

methods are becoming de facto standards, such as the CMM based methods developed at the

SEI, while others are intended to be international standards, such as those based on the SPICE

framework (Konrad 1994; Paulk and Konrad 1994b).

The contexts within which assessment methods have been applied include: (a) self-

assessments, and (b) maturity determination. Self-assessments are voluntary and their intended

purpose is for an organization to improve its own maturity. For example, in Motorola, maturity

self-assessments are employed to motivate and evaluate process improvement

progress (Daskalantonakis 1994). Maturity determination is commonly performed by one

organization to evaluate the maturity of its suppliers on an on-going or contract-award basis. For

example, the results of such assessments reportedly have significant weights in some contract

award decisions made by the U.S. Navy (Rugg 1993), and minimum maturity scores are

expected to be a requirement for some U.S. Air Force contracts (Saiedian and Kuzara 1995).

- 3 -

1 As pointed out in (Paulk and Konrad 1994a), there is a difference between a maturity assessment and a processcapability assessment. The former produces an overall software process maturity score(s) for an organization, whilethe latter produces score(s) for the implementation and institutionalization of specific process(es) in an organization,i.e., process measure(s) rather than organization measure(s). In this paper we focus more on organizationalmaturity. However, this does not result in any loss of generality in the presentation.

2 Many of these models are also partially based on the CMM for software developed by the SEI.

Important decisions are made by organizations based on assessment scores. For example, a

self-assessment that identifies strengths and weaknesses in an organization might lead the

organization to focus future improvement efforts and resources on rectifying the weaknesses.

The consequences of an erroneous interpretation of scores in a self-assessment situation could

be mis-allocation or inefficient allocation of resources. Also, a supplier who has been determined

to have a relatively low maturity score is less likely to be awarded a contract. The consequences

of an erroneous interpretation of scores in a contract-award situation could be the losers

contesting awards and starting costly litigation.

Some authors have discouraged an over-reliance on obtained assessment scores

(Besselman et al. 1993; Humphrey and Curtis 1991). In practice, however, characterizing

organizations quantitatively using such a score gives a sense of objectivity to decision making

and legitimizes actions taken. This indeed makes maturity scores an attractive proposition. For

example, at Motorola, Daskalontanakis (1994) states "Low scores identify key activities and key

process areas that need immediate attention to raise the organization's software process

capability". Haas et al. (1994) describe how maturity scores can be used to identify

organizational weaknesses with respect to ISO 9001 certification. In a contract award case,

Rugg (1993) states that "software capability - as measured in the submitted proposal and on site

- counted as one third of the weight for consideration to award the contract". Also, a letter from

the U.S. Department of the Air Force (dated September 1991) stated (Saiedian and Kuzara

1995) “We wish to point out that at some point in the near future, all potential software

developers will be required to demonstrate a software maturity Level 3 [on the CMM] before they

can compete in ESD/RL [...] major software development initiatives”.

1.2 Reliability of Assessments

Given the importance of the decisions made by organizations based on assessment scores, two

questions that need to be asked are: ”how reliable are such assessments?” and “what are the

implications of reliability for interpreting assessment scores?”. Reliability is defined as the extent

to which the same measurement procedure will yield the same results on repeated

trials (Carmines and Zeller 1979).

- 4 -

Recent literature and practice have reflected a concern with the reliability of assessments. For

example, Card discusses the reliability of SCEs in a recent article (Card 1992), where he

commented on the inconsistencies of the results obtained from assessments of the same

organization by different teams. Mention is also made of reliability in a contract award situation

where emphasis is placed on having one team assess different contractors to ensure

consistency (Rugg 1993). Bollinger and McGowan (1991) criticize the extent to which the scoring

scheme used in SCEs contributes towards reduced reliability. The Interim Profile method of the

SEI (Whitney et al. 1994) includes specific indicators to evaluate reliability. Furthermore, a deep

concern with reliability is reflected in the empirical trials of the prospective SPICE standard

whereby the evaluation of the reliability of SPICE-conformant assessments is an important focus

of study (El Emam and Goldenson 1995).

Of all the maturity assessment methods that are in the published literature, only one mention

of an actual reliability study was made by Humphrey and Curtis (1991). In that article they briefly

describe a study of level 2 and 3 questions on the preliminary version of the SEI maturity

questionnaire where the reliability (as estimated by an internal consistency method which will be

described later in this article) was found to be very high (0.9). They, however, omit the details of

the study. Furthermore, use of that reliability estimate in decision making is not standard

procedure, and, to our knowledge, is rarely ever done in practice.

The extent of unreliability has at least two important implications. First, the score obtained

from an assessment is only one of the many possible scores that would be obtained had the

organization been repeatedly assessed. This means that, for a given level of confidence that one

is willing to tolerate, an assessment score has a specific probability of falling within a range of

scores. The size of this range increases as reliability decreases. Second, when testing the

hypothesis that maturity is positively associated with the performance of organizations,

unreliability tends to attenuate the magnitude of the relationship. This means that the

relationship, which could be described using a correlation coefficient, is artificially reduced due to

unreliability. Both of the above implications ought to be seriously considered when presenting

and comparing assessment scores, and when empirically investigating relationships between

assessment scores and some criterion.

- 5 -

The reliability of maturity assessment methods can be estimated. Estimates of reliability would

allow one to determine the maturity score range for a given confidence level, and would also

allow one to make corrections for attenuation in correlation coefficients (Nunnally 1978). It is

therefore critical that estimates of the reliability of maturity assessment procedures be

determined and applied in decision making.

The objectives of this paper are to show how to estimate the reliability of maturity

assessments, and how to apply the estimate in decision making (e.g., comparing maturity scores

from two organizations and determining the relationship between maturity scores and criterion

measures of effectiveness). We demonstrate reliability estimation and its application through a

case study.

Section 2 presents a review of some basic reliability concepts. The research method for the

case study application is presented in section 3. In section 4, we demonstrate how to estimate

reliability and how to apply the estimate in decision making. Section 5 concludes the paper with a

discussion of the case study application results, and their implications for future research.

2 The Reliability of Measurement

In this section we introduce the background and some concepts regarding the reliability of

measurement. The intention here is only to place the remainder of the paper in context, and

therefore the discussion is admittedly brief. Further details may be found in the texts of Nunnally

(1978) and Lord and Novick (1968).

2.1 Overview

Much theoretical and analytical work related to measurement and the reliability of measurement

has been done in the fields of psychometrics and educational testing and measurement. This

work is subsumed under the heading of test theory. From a historical perspective, what is

considered to be the first full work on test theory is the text of Thorndike (1904). Since then, a

large body of work has expanded, refined, and added to his original theory. Part of this body of

knowledge is known collectively as classical test theory.

- 6 -

Reliability is considered by many psychometricians to be the fundamental problem in

measurement (Ghiselli et al. 1981). The reliability of measurement is concerned with random

error or nonsystematic influences on measurement operations. When we talk about the reliability

of measurement, we refer to the precision with which a particular attribute of a concept, object or

phenomenon is being measured. Defining and estimating reliability is important because it is

common that repeated measurement of the same attribute of an object, concept, or phenomenon

will not yield exactly the same quantitative values. If the magnitude of the underlying attribute

does not change across repeated measurements, then the fluctuations in the measured values

are considered to be measurement error.

For example, let's say we wish to measure the length of an object using a ruler, and this

measurement operation is repeated over and over again. If the markings on the ruler are

sufficiently close together so that the measurements are obtained to the nearest hundredth of an

inch, then the repeated measurement operations will likely yield several different lengths. This

kind of measurement is not perfectly reliable because of the differences in the lengths obtained

(assuming that the length of the measured object did not change).

2.2 Reliability vs. V alidity

The reliability of measurement is different from the validity of measurement. Reliability is

concerned with the extent to which measurement is repeatable and consistent. Validity is

concerned with the extent to which a measurement operation is measuring what it purports to

measure.

One could, for example, seek to measure intelligence by having children throw stones as far

as they could. The distance the stones were thrown on one occasion might correlate highly with

how far they are thrown on another occasion. Thus, being repeatable, the measure would be

highly reliable. However, the distance that stones are thrown would not be considered by many

observers to be a valid measure of intelligence.

The amount of measurement error places a limit on the amount of validity that a measurement

operation can have. But, even in the complete absence of measurement error, there is no

guarantee of validity. Reliability is a necessary but insufficient condition for validity.

- 7 -

2.3 Basic Concepts

A basic concept for comprehending the reliability of measurement is that of a construct. A

construct refers to a meaningful conceptual object. A construct is neither directly measurable nor

observable. However, the quantity or value of a construct is presumed to cause a set of

observations to take on a certain value. An observation can be considered as a question in a

maturity questionnaire (this is also referred to as an item). Thus, the construct can be indirectly

measured by considering the values of those items.

For example, organizational maturity is a construct. Thus, the value of an item measuring “the

extent to which projects follow a written organizational policy for managing system requirements

allocated to software” is presumed to be caused by the true value of organizational maturity.

Also, the value of an item measuring “the extent to which projects follow a written organizational

policy for planning software projects” is presumed to be caused by the true value of

organizational maturity3. Such a relationship is depicted in the path diagram in Figure 1. Since

organizational maturity is not directly measurable, the above two items are intended to estimate

the actual magnitude or true score of organizational maturity.

Since reliability is concerned with random measurement error, error must be considered in

any theory of reliability. The classic theory states that an observed score consists of two

components, a true score and an error score: X = T + E. Thus, X is the score obtained in a

maturity assessment, T is the mean of the theoretical distribution of X scores that would be found

in repeated assessments of the same organization using the same maturity assessment

procedure, and E is the error component.

The true score is considered to be a perfect measure of maturity. In practice, however, the

true score can never be really known since it is generally not possible to obtain a large number of

- 8 -

3 The two items used in the example are based on the SEI's Maturity Questionnaire (Zubrow et al. 1994)].

***** Insert Figure 1 around here *****

repeated assessments of the same organization4,5. True scores are therefore only hypothetical

quantities, but useful nevertheless.

Measurement error is the difference between the observed score and the true score. It is a

property of the measurement operation and not of the organization's maturity. Considering that

the errors are random, then the observed scores obtained from repeated measurements are

sometimes higher and sometimes lower than the true score. Therefore, the error scores are

positive as frequently as they are negative. This means that, in the long run, the mean error is

zero.

The reliability of measurement is defined as the ratio of true score variance to observed score

variance. A reliability coefficient has a value between 0 (perfect unreliability) and 1 (perfect

reliability). Thus, as the error variance increases, reliability approaches 0; and as the error

variance approaches 0, reliability approaches 1.

There are a number of methods for estimating the reliability of measurement procedures that

are based on the above measurement model. In Appendix A of this article, we describe these

methods, as well as provide directions on how to conduct reliability studies in general.

3 Research Method

To demonstrate reliability estimation and the application of such estimates, the conduct and

results of an example case study application are presented. In this section, we describe the

research method of the case study application. This description includes how we developed the

maturity assessment instrument, the definition of the target population, the sampling procedure

that was followed, and an evaluation of biases due to non-response (since we are not gathering

data from the whole population, we have to ensure that our sample is representative).

- 9 -

4 The true score as defined here is not a Platonic true score in the sense that it represents the "true" maturity of anorganization.

5 If one is willing to make some assumptions (e.g., an assumption of linearity), point estimates of true scores can becomputed from obtained scores (Lord and Novick 1968)

3.1 Study Background And Context

The case study application described here was performed within the context of an Information

Systems (IS) consultancy firm based in North America with clients worldwide (henceforth referred

to as Company X). The purpose of the study was to construct a reliable instrument for providing

a general measure of IS organizational maturity, and to use the instrument for assessing the

maturity of this firm's clients. This instrument would also be used as part of the software process

diagnosis and improvement services that Company X provides. The instrument could be used by

senior consultants who are knowledgeable about the particular clients. This knowledge would be

gained mainly through their participation in clients' projects. The main constraint on this

instrument development effort was that the instrument had to be short. The context of its use and

the fact that senior consultants were to provide the responses dictated the above constraint.

The domain of analysis for this study was business information systems that are fully

customized for individual user organizations. The unit of analysis was the IS department or IS

function in an overall organization. An IS department develops, maintains, and/or acquires

business information systems. In the remainder of this text the IS department will be referred to

as an organization.

3.2 Instrument Development

The first activity of instrument development was to review the existing literature on IS

organizational maturity. The main sources of information for the instrument developed here were

the work of the SEI on the CMM (Humphrey 1988; 1989), other contemporary maturity models

such as TRILLIUM (Bell Canada 1992), and the much earlier work of Nolan on defining a

maturity model for IS organizations (Nolan 1973).

Based on this review, an initial set of criteria for assessing organizational maturity were

formulated. Subsequently, 30 senior consultants were interviewed to solicit their comments on

the correctness and completeness of these criteria as general measures of organizational

maturity. The characteristics of these consultants were as follows: 70% had project management

backgrounds, 45% had technical backgrounds, and 33% had research and education

- 10 -

backgrounds6. Also, 91% of the interviewees were located in Canada, and 9% were located in

the USA. This distribution was dictated strongly by resource constraints.

A set of documents were also inspected. These documents were produced by the particular

consultants interviewed to aid them in their practice. The documents constituted auditing and

assessment questionnaires and frameworks. Some of these were 'homegrown', while others are

based on some of the published assessment methods (e.g., the SEI's SCE method). The

document inspections gave us indications, beyond what is in the literature, on the format and

content of maturity assessment instruments used in IS organizations, wording of questions, and

practices that are considered important.

As a consequence of the interviews and document inspections, the initial set of criteria were

refined into an initial organizational maturity instrument. A semantic differential scale (see

Osgood et al. 1967) was used for all the items in this instrument. A small pilot study was then

conducted to identify ambiguities, inconsistencies, bad wording and to generally get feedback on

the usability and clarity of the instrument. For this, two senior consultants from Company X were

requested to score an organization in an interview setting with one of the authors of this paper

present. Each interviewee was requested to talk out loud while scoring, indicating what he

interprets each question to mean and the rationale for his scoring. Also, two other researchers

highly familiar with process assessments from the Software Engineering Laboratory at McGill

University reviewed the instrument and noted problems with it. Based on this pilot study, the

initial organizational maturity instrument was refined again. An abridged copy of the final version

of the instrument that is relevant for this paper is included in Appendix B.

For the purposes of this paper, we will focus on four specific dimensions of organizational

maturity that are measured by this instrument. The four dimensions are as follows: (a)

standardization, which is concerned with process and product standardization in the

organization, (b) project management, which is concerned with the extent to which good project

management practices are employed, (c) tools, which is concerned with effective automated tool

- 11 -

6 It should be noted that an interviewee may be characterized as having an intersection of backgrounds, and hencethe total does not add up to 100%. For example, some interviewees were initially lead architects (technicalbackground) and were subsequently involved in project management activities, or some interviewees areconducting research or acting as course instructors on primarily technical issues (for instance data modelingtechniques).

usage in the organization, and (d) organization, which is concerned mainly with the alignment of

the IS organization with the overall business.

3.3 Sampling Procedure

For this instrument, the target population is the 200 organizations worldwide7 that have licensed

the systems development methodology developed and marketed by Company X. Organizations

cannot assess themselves, therefore a set of individuals had to be identified that can perform the

instrument ratings. The first question that needed to be answered was whether employees of the

organizations or employees of Company X should score the instrument. The latter case would

constitute an external assessor. To answer this question, a small pilot study was conducted. For

this pilot study, each of three client organizations were visited and either one or two of their

employees were interviewed. Amongst the questions asked were those related to the

organization's maturity. These interviews indicated that the client personnel were likely to give

the "right" answers, and hence seemed to provide biased judgements. Therefore, only Company

X employees were considered for administering the maturity questionnaire.

The sample frame for our study was the client organizations that had good consulting

relationships with Company X (and hence there existed a consultant with good knowledge of the

organization's practices). In an attempt to construct a stratified sample, the only population

characteristics that were available to us were Gross Revenue (of Company X) by region and by

industrial sector. However, such information was deemed to be highly misleading since a sizable

percentage of Gross Revenue is obtained from a small number of client organizations, and

hence would not appropriately characterize the population.

An initial list of senior consultants that work for Company X was thus formulated. It was

considered by senior management and researchers of Company X that the consultants whose

names are on the list were involved with a highly representative cross-section of their clientele

(they covered all business regions in Canada and outside Canada and all organization sizes that

Company X did business with). Systematically, consultants were selected from the list and

contacted (by telephone, face-to-face, or by electronic mail) and requested to participate in the

study. Some consultants refused to participate. The non-refusals constituted our sample. All

- 12 -

7 Since our study spanned more than one calendar year, this figure is only approximate and represents informationwe obtained from Company X's Annual Report at the time sampling was initiated.

consultants in the sample were highly familiar with the assessed organizations through their

consulting assignments.

3.4 Response Rate And Non-Response Bias

In total, 86 questionnaires were sent out to the senior consultants. Of the 86 questionnaires, 7%

were sent to consultants who had since left the company. Of the remaining 80 questionnaires,

we received 42 responses (including late respondents). This gives a response rate of 52.5%. Of

all the responses, 4 were unusable due to extensive missing data, leaving a total of 38 usable

responses. This sample is considered to represent 19% of the target population.

A total of 18 responses were received before the response deadline. All non-respondents

after the response deadline were contacted and reminded to fill out the questionnaire. When non-

respondents and late respondents were contacted and asked why they had not yet responded,

their primary stated reason was that they were too busy. Thus, we consider that to be the main

reason for non-response8. The characteristics of all respondents and the IS organizations that

they assessed are summarized in Figure 2.

Late respondents are considered to provide a good measure of the characteristics of non-

respondents (Armstrong and Overton 1977). To test for non-response bias, early respondents

were compared with late respondents with respect to their demographic characteristics. Given

the relatively small sample, a decision rule had to be employed in choosing the most appropriate

test (Siegel and Castellan 1988). The demographic characteristic frequencies were tabulated in

r x c tables, where c was always 2 and r was either 2 or 3. For 2x2 tables, if all expected

frequencies are greater than 5, then a Chi-Square test was used. Otherwise, the Fisher exact

test was used. For 3x2 tables, if less than 20% of the expected frequencies were less than 5 and

if no cell had an expected frequency of less than 1, then the Chi-Square test was used.

Otherwise, cells were combined by merging rows, and the 2x2 table decision rule was applied.

- 13 -


8 Due to resource limitations, we could not pursue the reasons for non-response to a greater level of detail.

All tests of non-response bias were two-tailed and were conducted at an alpha level of 0.1.

For the characterization by location (Canada/Outside Canada), by budget (low/high), by number

of personnel (low/high), by business sector (government/non-government), and by position of

respondents (management/technical/coaching-auditing), the null hypothesis of no difference

could not be rejected. However, for the characterization by overall experience (low/high

determined around the mean), there was a significant difference. A closer examination showed

that respondents with technical positions tend to be more experienced in the respondents group

compared with the late respondents group. Hence, on demographic characteristics there seems

to be a slight bias.

Finn et al. (1983) have argued that demographic differences between respondents and non-

respondents do not automatically signal bias. To investigate this possibility, respondents and

non-respondents were compared on their four maturity scores. The Mann-Whitney U test was

used. For all four tests, the null hypothesis of no difference between the medians could not be

rejected. Hence, there seems to be no significant bias in responses between respondents and

non-respondents, even though a slight demographic difference was identified.

4 Reliability Estimates And Applications

The results of the study and their application are demonstrated in this section. It should be noted

that our instrument was intended to provide a general measure of IS organizational maturity

rather than facilitating detailed analysis of an IS organization's strengths and weaknesses, and

should therefore be interpreted and used in a congruent manner.

4.1 Reliability Estimates

The items covering each of the four organizational maturity dimensions mentioned earlier are

shown in Figure 3. For each of the four dimensions, the Cronbach alpha coefficient has been

- 14 -


computed. Cronbach alpha is one reliability estimate. Details of this coefficient and its

computation are given in Appendix A. According to the guidelines given by Nunnally (1978), each

of these reliability coefficients would be interpreted to be relatively high.

Since each of the four maturity dimensions was considered separately, there is the resultant

implication that the scales are homogeneous within each dimension and that they are

heterogeneous across dimensions. Scores on these homogeneous scales can be combined

linearly to produce a more heterogeneous overall maturity scale. One possible caveat with such

linear combinations is the difficulty in interpreting the combined score (Allen and Yen 1979).

With the above caveat in mind, the four dimensions were combined linearly into a single

overall maturity score. Cronbach alpha reliability cannot be used to estimate reliability of the

combination since its calculation presumes homogeneous traits (Allen and Yen 1979). Therefore,

an alternative estimate described by Nunnally (1978) is used. The reliability estimate for this

composite is shown in the last row of Figure 3.

In the remainder of this section we demonstrate two common applications of reliability

estimates. These two applications are: presenting and comparing assessment scores, and

investigating relationships. Both applications represent the kinds of decisions that are made

using maturity scores.

4.2 Presenting And Comparing Scores

Figure 4 shows the maturity scores of two different organizations on the four maturity

dimensions. In this figure, the raw scores have been transformed into z scores to facilitate score

comparisons9. The first thing to note is that the observed scores are within a band defined by

plus or minus one standard error of measurement10. This makes it clear for score interpretation

that there is an element of uncertainty associated with each observed maturity score (i.e., it is

only one of many possible scores that would be obtained had the organization been assessed

- 15 -

9 A z-score is a standard score. Standard scores are linear transformations of raw scores that have a predefinedmean and standard deviation (see Angoff 1971). A z-score has a mean of 0 and a standard deviation of 1. This iscomputed by subtracting the mean from the raw score and dividing the result by the standard deviation. The unit ofa z-score is the standard deviation. Converting a raw score to a z-score has the advantages that it conveysinformation about the relative standing of a score and makes it possible to compare scores on different maturityscales, as we do in this paper.

10 The standard error of measurement, as used in this paper, can be viewed as the standard deviation of thedifferences between a typical organization's true score and the observed score over a large number of repeatedassessments. Also see (Allen and Yen 1979; Lord and Novick 1968; Nunnally 1978) for more details.

repeatedly). Such uncertainty is affected by the extent of unreliability of the assessment

procedure (the lower the reliability, the greater the standard error of measurement).

When comparing the scores on the four maturity dimensions for a single organization, the

standard error of measurement of the difference scores should be taken into account. For the

data from our case study, these standard errors of measurement are tabulated in Figure 5. The

extent to which the difference scores are less than the standard error of measurement reduces

the significance of the difference.

For example, the z score on the standardization dimension for organization P is -0.778, and

its score on the tools dimension is -0.301. The difference score is 0.477. This is less than the

0.4908 value tabulated in Figure 5, and hence one can conclude that the difference between the

scores on the two dimensions is not significant. Therefore, if the organization were to allocate

improvement resources to increase its maturity, it is not obvious from the assessment scores

which of these two dimensions deserves the more immediate attention.

Conversely, for organization P, the difference between the standardization and project

management dimensions is relatively large (difference is 1.392) compared to the standard error

of measurement of the difference score tabulated in Figure 5 (0.4589). Therefore, it is clear that

the difference between these two scores is significant. This means that for organization P, there

is a difference between the standardization and project management dimensions that is not

simply an artifact of chance. Organization P has good reason to allocate scarce resources to

improve its score on the standardization dimension instead of on the project management

dimension.

- 16 -



If we wish to compare the maturity scores obtained by two different organizations, the

standard error of measurement of the difference scores should be considered. For the data from

our case study, the standard errors of measurement for differences on the same dimension are

tabulated in Figure 6.

For example, should we wish to compare scores on the organization dimension, we find that

the difference between organizations P and Q in Figure 4 (which is 0.302) is less than the

standard error of measurement as tabulated in Figure 6 (which is 0.6074). Therefore, the

difference of the scores on this dimension for the two organizations should not be considered as

significant. This means that on the organization dimension, it is not clear from the scores which

organization is better. Conversely, the difference on the tools dimension (which is 0.615) is

significant when compared to the tabulated value of 0.4992. Therefore, one can confidently

conclude that organization Q is better than organization P on the tools dimension.

The comparison applications and the arguments for consideration of measurement errors can

be easily extended to the case of an organization using maturity scores to track its improvement

progress. In such a case the difference scores would be change scores.

4.3 Investigating Relationships

The second application that is considered here is concerned with ascertaining the benefits of

maturity. Achieving higher organizational maturity would be meaningless unless there were some

benefits to be gained. One possible way of gauging benefits is to look at the success of

processes and projects within the organization.

As part of our case study, we tested the hypothesis that organizational maturity (the four

dimensions) is positively associated with the success of the requirements engineering process.

The logic behind the hypothesis is that we expect that individual projects would be more

successful the higher the maturity of the organization. One could argue against the usefulness of

increased maturity if it is not associated with greater success of individual projects.

- 17 -


We focused on the requirements engineering process because it is considered to be one of

the more important processes in software development. For example, previous research has

shown that it costs 5 to 10 times more to fix errors during coding than during the requirements

engineering process (Boehm 1981), and that it costs from 100 to 200 times more during post-

deployment evolution (Boehm 1981; 1987). One study found a strong positive relationship

between software system errors and errors identified in requirements specifications (Davis 1989).

In addition, other authors found that the requirements engineering phase is the source of the

majority of detected software code errors (Basili and Perricone 1984; Endres 1975; Jones 1994;

Rubey et al. 1975).

During our case study, we collected data on the success of the requirements engineering

process for one project in each assessed organization. Each of the respondents was requested

to assess the success of the requirements engineering process for a project which he/she has

been recently or is currently involved in where the requirements engineering phase has been

entered and exited at least once.

The specific instrument that was used for assessing requirements engineering success is

described in detail in (El Emam and Madhavji 1995). This instrument assesses two dimensions of

requirements engineering success: the quality of requirements engineering service (this has two

subdimensions: (a) user satisfaction and commitment, and (b) the fit of the recommended

solution with the user organization), and quality of requirements engineering products (this has

two subdimensions: (a) the quality of the architecture, and (b) the quality of the costs/benefits

analysis). The reliability estimates for both of these dimensions are shown in Figure 7, along with

their standard deviations and means.

The Pearson product moment correlation was computed between each of the 4 dimensions of

maturity and each of the 2 dimensions of requirements engineering success. The resulting

correlation coefficients are shown in Figure 8. Given that the measurement of each of the

- 18 -


variable pairs was not perfectly reliable, the correlation coefficients were corrected for attenuation

(see Nunnally 1978). The outcomes of this correction are also shown in Figure 8. As can be

seen, the correction did not change the magnitude of the correlation remarkably. It nevertheless

demonstrates the concept of attenuation due to unreliability.

The results of this analysis reveal that only the organization dimension of maturity is

significantly related to the quality of service. However, given that approximately only 30% of the

variation in quality of service is explained by the organization dimension, such a relationship is

not very strong. The correlations between the overall maturity score and requirements

engineering success is also shown in Figure 8. A small significant relationship was found

between maturity and quality of service, while no relationship was found with quality of products.

Thus, we cannot present any strong evidence showing a strong positive association between

organizational maturity and the success of the requirements engineering process.

The scaling model used for the measurement of the organizational maturity dimensions is the

summative (or "Likert", as it is also referred to) model (McIver and Carmines 1981). In this model,

the scores on all the items in a dimension are summed to obtain the total dimension score.

Galletta and Lederer (1989), when discussing the User Information Satisfaction instrument

(which also uses a summative model), consider the resultant measure to be on an ordinal

scale11. Given this interpretation, it would be more appropriate to use a nonparametric correlation

coefficient in our analysis, such as the Spearman rank correlation coefficient (Siegel and

Castellan 1988).

The Spearman rank correlation coefficients for all the hypothesized relationships are shown in

Figure 8. As can be seen, a statistically significant relationship between the organization

dimension and quality of service is indicated. Also, a weak, but statistically significant,

- 19 -

11 McIver and Carmines (1981), however, consider that the summative scaling model produces interval level scales. Ingeneral, and following Tukey's perspective (Tukey 1986) that if a scale is not interval, it does not necessarily haveto be merely ordinal, our maturity scales are expected to occupy the grey region between ordinal and interval levelmeasurement. We therefore present, both, the parametric and nonparametric measures of association. As is seenin the analysis, based on our data, our conclusions would not differ using either of the measures of association.


relationship between overall maturity and quality of service was found. No relationship between

maturity and quality of products was found. Moreover, small negative correlations between the

standardization and project management dimensions and the quality of products were found.

However, these were not significant.

We also conducted a posthoc power analysis12 of the results of the correlational analysis. The

tables in (Cohen 1977; Kraemer and Thiemann 1987) were used for the Pearson and Spearman

coefficients. A 0.05 a-level and one-tailed tests were used in this analysis. In the case of the

quality of products dimension, all tests had a power of less than approximately 15%. In the case

of the quality of service dimension, the standardization, project management and tools tests had

power of less than approximately 50%; the organization dimension tests had a power of

approximately 95%; and the overall maturity tests had a power of approximately 81% for the

Pearson correlation and approximately 60% for the Spearman correlation.

It can be seen that the tests of significance were, in general, not powerful enough to detect a

significant relationship. The generally low magnitudes of the correlation coefficients contribute to

the low power, as well as the small sample sizes. Ideally, one would strive for a power value

between 90% and 95%13.

The results of the above correlational analysis should be qualified by a number of

explanations. First of all, no causal relationship was implied in the hypothesis nor in the

subsequent analysis. This is especially true given that the data was collected cross-sectionally

rather than longitudinally. Second, given our limited domain of analysis and sample frame, and

the lack of internal or external replication of this study, great caution should be taken in

generalizing the lack of strong relationships between maturity and requirements engineering

success. Third, it is plausible that our measures are not highly valid. Strictly speaking, ensuring

validity is an on-going process. While we have taken great care in producing valid measures (a

summary of the evidence on validity is given in appendix C), further studies are necessary to

confirm our validity claims. Fourth, contingency variables may be moderating the relationships, a

- 20 -

12 Statistical power is defined as the probability that a statistical test will correctly reject the null hypothesis. It isgenerally recommended that a power analysis be conducted especially when the results of a research study do notfind significant relationships between the variables of interest (Baroudi and Orlikowski 1989).

13 It is informative to note that to attain 90% power with effect sizes so small (e.g., correlation coefficients between 0.1and 0.3), an approximate sample size of 93 is required for r=0.3, and an approximate sample size of 864 is required

for r=0.1 (from the tables in (Cohen 1977) using a=0.05 for a one tailed test).

possibility which was not considered here. However. given that the software process community

is in the formative stage of theory development within the domain of organizational maturity

research, the assumptions made here are not inappropriate. Finally, the generally low power of

the tests of significance may be the reason for the inability to reject the null hypothesis of no

relationship.

To summarize then, we have shown that reliability estimates can be applied in correcting the

attenuation in relationships. In our specific case study, we investigated the relationship between

maturity and requirements engineering success. Only a minor relationship was found between

the organization dimension of maturity and the quality of requirements engineering service. No

relationship was found between maturity and quality of requirements engineering products.

5 Conclusions

The objectives of this paper were to review some of the basic concepts and methods concerning

the reliability of measurement and reliability estimation, and to show how these estimates can be

used in decision making. This review should prove useful to both researchers and practitioners

intending to develop instruments for assessing organizational maturity by providing them with an

initial understanding of the concepts and pointers for further exploration. Furthermore, we hope

to have created an awareness of measurement error in assessment scores and hence have

contributed towards improving the interpretation of such scores.

We have also presented an example case study that was intended to provide practitioners

and researchers with an initial reference example for their instrument development efforts. For

our case study, we developed an instrument-based assessment method with reliabilities in the

range of 0.8-0.9, which is considered to be high. This is consistent with the reliability coefficient

reported in (Humphrey and Curtis 1991) for another assessment method.

In the case study, we demonstrated how to use reliability estimates in decision making

situations. Specifically, two decision making situations were considered. First, when comparing

assessment scores, reliability should be taken in account to determine whether the difference is

significant. Second, when investigating the relationship between maturity scores and some

- 21 -

criterion of effectiveness, reliability estimates can be used correct the magnitude of the

relationship for attenuation due to unreliability.

A further finding from our case study was the lack of strong relationships between our

measures of organizational maturity and our criterion measure of effectiveness, namely the

success of the requirements engineering process. Both, parametric and nonparametric

correlations were found to be generally low and many were not significantly different from zero. A

posthoc power analysis indicated that for effect sizes of such a small magnitude, much larger

sample sizes would be required in future empirical studies.

Given the rise in the use of maturity assessment methods in industry, and given the

implications of decisions made based on the results of such assessments, it would be prudent to

increase the reliability of these methods and their application. Guidelines for increasing the

reliability of assessment methods and their applications are given in Appendix D of this paper. It

would also be prudent of developers of such methods to ascertain how reliable they are and to

publish the details of their reliability studies. Where no reliability estimates are provided, it would

be prudent of users of such methods to be aware of the possibility of measurement error and to

reduce their reliance on such methods or at least their reliance on assessment scores in their

own decision making.

Acknowledgements

The authors wish to thank Jean-Normand Drouin, Dennis Goldenson, and the anonymous

referees for their valuable comments on an earlier version of this paper.

References

Allen, M. and Yen, W. 1979. Introduction to Measurement Theory. Brooks/Cole PublishingCompany.

Angoff, W. 1971. Scales, norms, and equivalent scores. Educational Measurement, R. Thorndike(ed.), American Council on Education.

Armstrong, J. and Overton, T. 1977. Estimating nonresponse bias in mail surveys. Journal ofMarketing Research, Vol. XIV, August, pages 396-402.

Baroudi, J. and Orlikowski, W. 1989. The problem of statistical power in MIS research. MISQuarterly, March, pages 87-106.

- 22 -

Basili, V. and Perricone, B. 1984. Software errors and complexity: An empirical investigation.Communications of the ACM, 27(1):42-52.

Bell Canada 1992. TRILLIUM: Telecom Software Product Development Process CapabilityAssessment. Technical Report, Bell Canada.

Coallier, F. 1995. TRILLIUM: A model for the assessment of telecom product development &support capability. Software Process Newsletter, IEEE Computer Society, Winter, No. 2,pages 3-8.

Cohen, J. 1977. Statistical Power Analysis for the Behavioral Sciences. Academic Press.

Besselman, J., Byrnes, P., Lin, C., Paulk, M. and Puranik, R. 1993. Software CapabilityEvaluations: Experiences from the field. SEI Technical Review.

Boehm, B. 1981. Software Engineering Economics, Prentice Hall.

Boehm, B. 1987. Industrial software metrics top 10 list. IEEE Software, September, pages 84-85.

Bohrnstedt, G. 1970. Reliability and validity assessment in attitude measurement. AttitudeMeasurement, G. Summers (ed.), Rand-McNally, pages 80-99.

Bollinger, T. and McGowan, C. 1991. A critical look at Software Capability Evaluations. IEEESoftware, July, pages 25-41.

Card, D. 1992. Capability evaluations rated highly variable. IEEE Software, September,pages 105-106.

Carmines, E. and Zeller, R. 1979. Reliability and Validity Assessment, Sage Publications,Beverly Hills.

Cohen, J. and Cohen, P. 1983. Applied Multiple Regression / Correlation Analysis for theBehavioral Sciences, Lawrence Erlbaum Associates.

Cronbach, L. 1951. Coefficient alpha and the internal consistency of tests. Psychometrika,September, pages 297-334.

Daskalantonakis, M. 1994. Achieving higher SEI levels. IEEE Software, July, pages 17-24.

Davis, J. 1989. Identification of errors in software requirements through use of automatedrequirements tools. Information and Software Technology, November, 31(9):472-476.

Dorling, A. 1993. SPICE: Software Process Improvement and Capability dEtermination.Information and Software Technology, June/July, 35(6/7):404-406.

Drew, D. 1992. Tailoring the Software Engineering Institute's (SEI) Capability Maturity Model(CMM) to a software sustaining engineering organization. Proceedings of the InternationalConference on Software Maintenance, pages 137-144.

Drouin, J-N 1994a. Software quality - An international concern. Software Process, Quality & ISO9000, August, 3(8):1-4.

Drouin, J-N 1994b. The SPICE project: An overview. Software Process Newsletter, IEEEComputer Society, Winter, No. 2, pages 8-9.

El Emam, K. and Goldenson, D. R. 1995. SPICE: An empiricist's perspective. Proceedings of theSecond International Software Engineering Standards Symposium (to appear).

El Emam, K. and Madhavji, N. H. 1995. Measuring the success of requirements engineeringprocesses. Proceedings of the Second IEEE International Symposium on RequirementsEngineering, March, pages 204-211.

Endres, A. 1975. An analysis of errors and their causes in system programs. IEEE Transactionson Software Engineering, June, 1(2):140-149.

Finn, D., Wang, C-K, and Lamb, C. 1983. An examination of the effects of sample compositionbias in a mail survey. Journal of the Marketing Research Society, 25(4):331-338.

- 23 -

Galletta, D. and Lederer, A. 1989. Some cautions on the measurement of User InformationSatisfaction. Decision Sciences, 20:419-438.

Ghiselli, E., Campbell, J., and Zedeck, S. 1981. Measurement Theory for the BehavioralSciences, W. H. Freeman.

Guilford, J. 1954. Psychometric Methods, McGraw-Hill.

Haase, V., Messnarz, R., Koch, G., Kugler, H., and Decrinis, P. 1994. Bootstrap: Fine-tuningprocess assessment. IEEE Software, July, pages 25-35.

Humphrey, W. 1988. Characterizing the software process: A maturity framework. IEEE Software,March, pages 73-79.

Humphrey, W. 1989. Managing the Software Process, Addison-Wesley.

Humphrey, W. and Curtis, B. 1991. Comments on 'A Critical Look'. IEEE Software, July,pages 42-46.

Japan SC7 WG10 SPICE Committee 1994. Report of Japanese trial process assessment bySPICE method. A SPICE Project Report.

Jones, C. 1994. Assessment and Control of Software Risks, Prentice-Hall.

Kerlinger, F. 1986. Foundations of Behavioral Research. Harcourt Brace Jovanovich, Orlando,FL.

Koch, G. 1993. Process assessment: The 'BOOTSTRAP' approach. Information and SoftwareTechnology, June/July, 35(6/7):387-403.

Konrad, M. 1994. On the horizon: An international standard for software process improvement.Software Process Improvement Forum, September/October, pages 6-8.

Kraemer, H. and Thiemann, S. 1987. How Many Subjects? Statistical Power Analysis inResearch, Sage Publications, Beverly Hills.

Kuvaja, P., Simila, J., Kranik, L., Bicego, A., Saukkonen, S., and Koch, G. 1994. SoftwareProcess Assessment and Improvement: The Bootstrap Approach, Blackwell.

Lord, F. and Novick, M. 1968. Statistical Theories of Mental Test Scores, Addison-Wesley.

McIver, J. and Carmines, E. 1981. Unidimensional Scaling. Sage Publications, Beverly Hills.

Nolan, R. 1973. Managing the computer resource: A stage hypothesis. Communications of theACM, July, 16(7):399-405.

Nunnally, J. 1978. Psychometric Theory, McGraw-Hill.

Osgood, C., Suci, G., and Tannenbaum, P. 1967. The Measurement of Meaning, University ofIllinois Press.

Paulk, M. and Konrad, M. 1994a. Measuring Process Capability Versus Organizational ProcessMaturity. Proceedings of the 4th International Conference on Software Quality.

Paulk, M. and Konrad, M. 1994b. ISO seeks to harmonize numerous global efforts in softwareprocess management. Computer, April, pages 68-70.

Paulk, M., Curtis, B., Chrissis, M-B, and Weber, C. 1993a. Capability Maturity Model, version 1.1.IEEE Software, July, pages 18-27.

Paulk, M., Curtis, B., Chrissis, M-B, and Weber, C. 1993b. Capability Maturity Model, version 1.1.Technical Report CMU/SEI-93-TR-24, Software Engineering Institute.

Rubey, R., Dana, J., and Biche, P. 1975. Quantitative aspects of software validation. IEEETransactions on Software Engineering, June, 1(2):150-155.

Rugg, D. 1993. Using a capability evaluation to select a contractor. IEEE Software, July,pages 36-45.

- 24 -

Saiedian, H. and Kuzara, R. 1995. SEI Capability Maturity Model's impact on contractors.Computer, January, 28(1): 16-26.

Sethi, V. and King, W. 1991. Construct measurement in information systems research: Anillustration in strategic systems. Decision Sciences, 22:455-472.

Siegel, S. and Castellan, J. 1988. Nonparametric Statistics for the Social Sciences, McGraw Hill.

SEI 1994a. Software Capability Evaluation Version 2.0: Method Description. Technical Report,CMU/SEI-94-TR-06, Software Engineering Institute.

SEI 1994b. Software Capability Evaluation (SCE) Version 2.0: Implementation Guide. TechnicalReport, CMU/SEI-94-TR-05, Software Engineering Institute.

Subramanian, A. and Nilakanta, S. 1994. Measurement: A blueprint for theory-building in MIS.Information and Management, 26:13-20.

Thompson, K., Ince, D., Madden, P., and Angelone, E. 1992. Practical quality improvementthrough software process maturity. Technical Report, Institute of Software Engineering,Belfast.

Thorndike, E. 1904. An Introduction to the Theory of Mental and Social Measurements, SciencePress.

Traub, R. 1994. Reliability for the Social Sciences: Theory and Applications, Sage Publications,Beverly Hills.

Tukey, J. 1986. Data analysis and behavioral science or learning to bear the quantitative man'sburden by shunning badmandments. The Collected Works of John W. Tukey, Vol. III, L.Jones (ed.), pages 187-389, Wadsworth & Brooks/Cole.

Whitney, R., Nawrocki, E., Hayes, W., and Siegel, J. 1994. Interim Profile: Development and trialof a method to rapidly measure software engineering maturity status. Technical Report,CMU/SEI-94-TR-4, Software Engineering Institute.

Zubrow, D., Hayes, W., Siegel, J., and Goldenson, D.R. 1994. Maturity Questionnaire. TechnicalReport, CMU/SEI-94-SR-7, Software Engineering Institute.

- 25 -

Appendix A : Estimating Reliability

This appendix describes how to conduct reliability studies. This includes some general

considerations as well as specific reliability estimation methods.

Reliability estimation methods would usually be applied during a "pretest" study whose

objective would be to estimate the reliability of the particular assessment procedures that are

prescribed. Reliability estimates (or coefficients) are calculated from assessment scores of a

sample of organizations, not the whole population of organizations. Thus, a pretest study

provides a sample estimate of the reliability for the whole population. For different samples, it is

likely that different sample estimates would be obtained.

When conducting a pretest study, at least the following three issues should be taken into

consideration:

1 Sample Representativeness

The sample of organizations that are chosen should represent a well-defined population.

The extent to which the reliability coefficients can be generalized beyond the pretest sample

itself depends on the representativeness of the sample.

2 Identical Assessment Procedures and Conditions

The procedures and conditions under which the assessment was performed during the

pretest study should be similar to those that will exist in real applications of the prescribed

assessment procedures. Otherwise, the reliability coefficients obtained during the pretest

study may not pertain to actual applications of the assessment procedures.

3 Independence of Assessments

Assessments conducted during a pretest study should be independent of each other. This

means that the assessment of one organization should not be influenced by nor have an

influence on any other organization's assessment.

There are four basic methods for estimating reliability. All of the four methods attempt to

determine the proportion of variance in a measurement scale that is systematic. In general, with

these methods one correlates a score from a particular scale with scores from some form of a

- 26 -

replication of the scale. If the correlation is high, most of the variance is of the systematic type.

The different methods can be classified by the number of different assessment procedures

necessary and the number of assessment occasions necessary. This classification is depicted in

Figure 9.

Test-Retest Method

This is the simplest method for estimating reliability. In our context, one would assess each

organization's maturity at two points in time using the same assessment procedure. Reliability

would be estimated by the correlation between the scores obtained on the two assessments.

This method has the advantage of requiring only one form of the assessment procedure.

The primary disadvantages of using this method are threefold. First, it is often expensive and

even impractical to conduct maturity assessments at more than one point in time. Prior

experience has identified the costs of assessments as a concern (Besselman et al. 1993; Japan

SC7 WG10 SPICE Committee 1994). Second, it is not obvious that a low reliability coefficient

obtained using this method indicates low reliability. Another possible reason for a low reliability

coefficient is that the organization's maturity has changed between the two assessments. For

example, the initial assessment results could sensitize an organization to specific weaknesses

which may prompt them to initiate an improvement effort that influences the result of subsequent

assessments. This would generally lead to an underestimate of reliability. Third, carry-over

effects between assessments may lead to an overestimate of reliability. For example, the

reliability coefficient can be artificially inflated due to memory effects like the assessees knowing

the 'right' answers that they have learned from previous assessments and assessors

remembering responses from previous assessments and, deliberately or otherwise, repeating

them in an attempt to maintain consistency of results.

Estimating reliability using the test-retest method is troublesome because it is not easy to

determine an appropriate time interval between assessments. If one increases the interval to

- 27 -


minimize carry-over effects, then one is also increasing the likelihood that the true organizational

maturity has changed. Due to its heavy focus on stability over time, the test-retest reliability

coefficient is also sometimes referred to as the stability coefficient.

Alternative Forms Method

Instead of using the same assessment procedure on two occasions, the alternative forms

method stipulates that two alternative assessment procedures be used. This can be achieved, for

example, by using two different maturity questionnaires or having two alternative, but equally

qualified, assessors (or assessments teams) for the two occasions.

This method can be characterized either as immediate (where the two occasions are

concurrent in time), or delayed (where the two occasions are separated in time). The correlation

coefficient is then used as an estimate of reliability of either one of the alternate forms. If

assessments are made on more than two occasions, the usual practice is to take the average of

the inter-correlations as an estimate of reliability. Interpreting the correlation coefficient in such a

manner is only applicable if the obtained scores satisfy the criteria for parallel tests14 or are linear

functions of parallel test scores.

In the delayed alternative forms method, the disadvantages are similar to those for the test-

retest method. Furthermore, for both (immediate and delayed) approaches, there is the practical

difficulty, and hence the possibility, that the alternative forms are not measuring organizational

maturity to the same degree. This would lead to an underestimate of reliability. In the case where

different assessors are used in the alternative forms, any discussions amongst them about the

assessed organization and its status immediately prior to or during the assessment would likely

contaminate the reliability estimates (i.e., the assessments would no longer be independent).

Since there is heavy reliance on the degree to which alternative forms are really equivalent,

the reliability estimate obtained using this method is referred to as the equivalent-forms

coefficient.

- 28 -

14 According to the concept of parallel tests, we can ascertain the extent of measurement error if we can obtain aseries of measures for a particular construct (e.g., organizational maturity). A series of k measures is consideredparallel if the true score is the same for all k measurements of the construct, and if the error variance of each of thek measurements is equal. The concept of parallel tests thus indicates that each of the k measures should be asgood a measure of the construct as any of the other k-1 measures. It is not necessary that the same measurementoperation be performed for all the k parallel measurements, but only that all the measurement operations must bemeasuring the same construct to the same degree.

Split-Halves Method

With the split-halves method, the total number of items in an assessment instrument are divided

into two halves and the half-instruments are correlated to get an estimate of reliability. The

halves can be considered as approximations to alternative forms. A correction must be applied to

the correlation since that correlation gives the reliability of each half. One correction is known as

the Spearman-Brown prophecy formula: 2r/(1 + r), where r is the correlation. The Spearman-

Brown formula should only be used when the halves can be considered as parallel tests.

Otherwise, the Cronbach alpha coefficient (described later) should be used. Another formula

developed by Rulon (Traub 1994) can be used to estimate the reliability of the whole instrument

and does not require assumptions about the halves being parallel.

This method has the advantage that equivalent forms of an assessment procedure are not

necessary and there is no need for assessments on multiple occasions. A problem with this

method, however, is deciding how to divide an instrument in two parts. The procedure generally

used is to take even numbered items on an instrument as one part and the odd numbered ones

as the second part. This approach is generally preferred since, assuming the sequence of

responses obtained follows the sequence in the instrument, it controls for any systematic factors

operating during the assessment period that may influence responses from earlier on in the

assessment to later on in the assessment.

Internal Consistency Method

With this method one examines the covariance amongst all the items in an assessment

instrument simultaneously. Two estimates of internal consistency are Guttman's L2 (Traub 1994)

and Cronbach's alpha (Cronbach 1951).

In another related scientific discipline, namely Management Information Systems (MIS),

researchers tend to report the Cronbach alpha coefficient most frequently (Subramanian and

Nilakanta 1994). Also, it is considered by some researchers to be the most important reliability

estimation approach (Sethi and King 1991). Thus, due to its frequent use and perceived

importance, the logic of computing the Cronbach alpha coefficient (from a covariance matrix) will

be described in more detail below.

- 29 -

The type of scale used in the most common maturity assessment procedures is a summative

one. This means that the individual scores for each question are summed up to produce an

overall score. One property of the covariance matrix for a summative scale that is important for

the following formulation is that the sum of all the elements in the matrix give exactly the variance

of the scale as a whole.

One can think of the variability in a set of item scores as being due to one of two things: (a)

actual variation across the organizations in maturity (i.e., true variation in the construct being

measured) and this can be considered as the signal component of the variance, and (b) error

which can be considered as the noise component of the variance. Computing the Cronbach

alpha coefficient involves partitioning the total variance into signal and noise. The proportion of

total variation that is signal equals alpha.

The signal component of variance is considered to be attributable to a common source,

presumably the true score of the construct underlying the items. When maturity varies across the

different organizations, scores on all the items will vary with it because it is a cause of these

scores. The error terms are the source of unique variation that each item possesses. Whereas all

items share variability due to maturity, no two items share any variation from the same error

source (this is an assumption of the classical theory presented earlier).

Unique variation is the sum of the elements in the diagonal of the covariance matrix: Ssi2 .

Common variation is the difference between total variation and unique variation: sy2 - Ssi

2, where

the first term is the variation of the whole scale. Therefore, the proportion of common variance

can be expressed as: (sy2 - Ssi

2)/sy2 . To express this in relative terms, the number of elements

in the matrix must be considered. The total number of elements is k2, and the total number of

elements that are communal are k2 - k. Thus the corrected equation for coefficient alpha

becomes: (k/k-l)[(sy2 - Ssi

2)/sy2].

- 30 -


To give a concrete example, we consider the covariance matrix for one dimension of maturity

described in the body of this paper, namely standardization. Using the data from our case study

application, the covariance matrix of the standardization dimension is shown in Figure 10. The

M's in the table correspond to the items in Figure 3. Using the equation given above, the value of

Cronbach alpha was computed to be: 1.25 [(53.95 - (4.32 + 3.86 + 2.62 + 2.55 + 2.46))/53.95] =

0.8837. This is the value given in Figure 3.

- 31 -

Appendix B: Part of the Organizational Maturity Instrument

Instructions

The purpose of this instrument is to measure the overall maturity of an InformationSystems department. All the questions concern the particular Information Systems depart-ment that you have been involved with recently. Pilot applications of this instrument indi-cate that it takes approximately 10-20 minutes to complete. Please answer all the ques-tions.

You will find in this questionnaire some overall characteristics of the InformationSystems department. These characteristics are considered important with respect to thedepartment's maturity. Beneath each characteristic you will find a scale. You are to ratethe characteristics on each of these scales in order.

Here is how you are to use those scales:

If you feel that the concept is very closely characterized by one end of the scale, youshould check-mark as follows:

If you feel that the concept is quite closely characterized by one or the other end ofthe scale (but not extremely), you should place your check-mark as follows:

If the concept seems only slightly characterized by one side of the scale as opposedto the other side (but is not really neutral), then you should check-mark as follows:

If you consider the concept to be characterized as neutral on the scale, or bothsides of the scale equally characterize the concept, then you should place your check-mark in the middle space:

Thank you for your time spent completing this questionnaire.

sufficient |__|__|__|__|__|__|X__| insufficient

sufficient |X__|__|__|__|__|__|__| insufficient

sufficient |__|__|__|__|__|X__|__| insufficient

sufficient |__|X__|__|__|__|__|__| insufficient

sufficient |__|__|__|__|X__|__|__| insufficient

sufficient |__|__|X__|__|__|__|__| insufficient

sufficient |__|__|__|X__|__|__|__| insufficient

- 32 -

1. The delivery approach and methodology:

2. The modeling standards:

3. The systems documentation:

4. User requirements standards:

5. Development standards:

6. Systems cost/benefits analysis are produced and monitored:

7. The delivery of systems:

8. A project manager is clearly identified for every project:

9. The project plan is produced, updated and communicated:

10.Project reviews:

none or varies withindividual

|__|__|__|__|__|__|__| defined and used by all

no standards |__|__|__|__|__|__|__| strict standards basedon architecture

none or very inconsis-tent

|__|__|__|__|__|__|__| available, current andclear

none |__|__|__|__|__|__|__| well established andimplemented

none |__|__|__|__|__|__|__| well established andimplemented

frequently not done |__|__|__|__|__|__|__| formal justification andmonitoring

consistently late |__|__|__|__|__|__|__| consistently on time

seldom |__|__|__|__|__|__|__| always

seldom |__|__|__|__|__|__|__| frequent

none or undefined |__|__|__|__|__|__|__| defined and implemented

- 33 -

11.Systems implementation and deployment:

12.The use of project management tools:

13.The integration of methodology and techniques with tools:

14.The availability of tools:

15.Support for tools:

16.The overall organization's strategy, missions, goals, tactics and priorities:

17.The IS organization's strategy, goals, and priorities:

18.The alignment between business strategy and system projects:

19.The IS organization's ability to absorb and implement innovations:

20.The budgeting process of the IS organization:

improvised |__|__|__|__|__|__|__| smooth and wellplanned

none |__|__|__|__|__|__|__| pervasive

no integration |__|__|__|__|__|__|__| effectively reenforceeach other

none |__|__|__|__|__|__|__| yes to everybody

inadequate support |__|__|__|__|__|__|__| adequate support

Undefined |__|__|__|__|__|__|__| clearly documentedand understood

none |__|__|__|__|__|__|__| clearly documentedand understood

weak |__|__|__|__|__|__|__| strong

inferior |__|__|__|__|__|__|__| superior

integrated with organi-zational priorities

|__|__|__|__|__|__|__| not integrated withorganizational priorities

- 34 -

Appendix C: Evidence of Validity

In this appendix we present some further evidence as to the validity of the maturity scales that

we have developed. Three sets of analysis were performed following the recommendations of

Bohrnstedt (1970), Kerlinger (1986), and Nunnally (1978).

Average Interitem Correlations

The items within a dimension should correlate higher with each other than they do with items in

other dimensions (Bohrnstedt 1970). Therefore, we computed the average interitem correlations

within each dimension and compared them with the average correlation of these same items with

items in the other three dimensions. In all comparisons performed, it was found that the within

dimension average correlation was higher than the average correlations with other dimensions.

Item-total Correlations

The items within a dimension are expected to correlate highly with the total score for that

dimension. Such correlations provide further evidence of validity (Kerlinger 1986; Nunnally 1978).

In calculating such item-total correlations, each item score was subtracted from the total to avoid

a spurious part-whole correlation (Cohen and Cohen 1983), and the correlation of each item with

the new total score was computed. These results are shown in Figure 11. As can be seen, all

correlations are relatively high, and all are significant (one-tailed test) at a=0.005.

Factor Analysis

Factor analysis is considered to be (Kerlinger 1986) "a powerful and indispensable method of

construct validation." The results of this analysis for the four dimensions are shown in Figure 11.

As can be seen, the items that are expected to tap one dimension load highly on one factor, and

have low loadings on the other factors (all missing coefficients are less than 0.5).

- 35 -

***** Insert Figure 1 1 around here *****

The results of the above three sets of analyses, as well as the method used to develop the

maturity instrument, increases our confidence as to the validity of the maturity scales. However,

demonstrating validity is an on-going process, and it is through the accumulation of evidence

from multiple studies that we can start to make strong claims of validity.

- 36 -

Appendix D: Guidelines for Increasing Reliability

Below, we present some development and use guidelines for maturity assessments. These

guidelines identify issues that should be taken into consideration by those who develop

assessment procedures, and by those who use them. These guidelines describe ways of

increasing the reliability of assessment scores:

1 Standardize Assessment Procedures

The procedures used for an assessment must be standardized and individual assessment

results must follow them closely to ensure consistency. In the case of assessment

instruments, instructions concerning the purpose and how to determine responses and judge

scores should be provided. In the case of interviews, the conduct of the interviews (e.g.,

assurance of confidentiality and the type and scope of evidence inspected) should be

defined.

2 Training of Assessors

Assessors should be trained in the assessment procedure and should have extensive

experience with software development and maintenance. Furthermore, there should be a

consistency in the qualifications of the assessors following a particular assessment

procedure.

3 Increasing the Length of the Assessment Instrument

Reliability estimates utilize the assessment scores. The more questions asked about the

maturity of an organization, the more likely that the reliability estimates are increased. Of

course, if the added questions have nothing to do with maturity, then increasing the length of

the instrument may reduce reliability. However, it is assumed that added questions are

chosen as carefully as the original questions and that they will not reduce the average inter-

item correlations.

- 37 -

4 Sampling of Projects

In assessment procedures where a sample of an organization's projects are assessed, and

these are used as an indicator of overall organizational maturity, specific sampling criteria

should be specified. These sampling criteria should be applied consistently in all

assessments claiming to follow a particular assessment procedure.

5 Using Multiple Point Scales

Determining the number of points on a scale involves a tradeoff between losing some of the

discriminative powers which the assessors are capable of (with too few points) and having a

scale that is too fine and hence beyond the assessors' powers of discrimination (with too

many points). In general, it has been found that there is an increase in reliability as the

number of points increases from 2 to 7, after which it tends to level off (Nunnally 1978;

Guilford 1954).

6 Having a Validation Cycle

Such a cycle involves validating the information that the assessors have initially gathered.

This may involve corroborative interviews and (further) inspections of documents. This would

seem to be more important were the assessors are external to the assessed organization

and when there is a danger of misrepresentation by the assessees.

7 Using the Same Assessors

Where no estimates of reliability are available or no reliability studies have been performed

for a particular procedure, it is safer to have the same assessors perform assessments on

different occasions and/or for different organizations. For example, in progress self-

assessments where an organization wants to determine whether maturity has increased due

to some improvement efforts, it is preferable that the same assessors be used. Also, if one

were to rank n organizations based on the assessment results, reliability would be increased

if the same assessors conducted all the assessments.

Following the above guidelines would be considered as good practice for increasing the

reliability of assessment procedures. Of course, not all the guidelines are applicable for all

- 38 -

assessment procedures. The context of using the assessment procedure should be taken into

consideration.

- 39 -

- 40 -

Figure 1: Path diagram depicting the organizational maturity construct and example items for itsassessment.

- 41 -

Characteristic Value Percentage/Average

Location of Organizations CanadaU.S.A.

Australia

58%31.6%10.5%

Position of Respondents ManagementTechnical

Coaching/Auditing

39.5%31.6%28.9%

Years of Experience ofRespondents

ManagementTechnical

Coaching/Auditing

11.47 Yrs.11.17 Yrs.14.18 Yrs.

Main Business of Organization Government/Public Admin./MilitaryRetail, Distribution and Transportation

AerospaceFinancial/Insurance/Real Estate

Manufacturing (other than aerospace)Other

36.8%18.4%15.8%7.9%7.9%13.1%

Budget or Total Revenue ofOrganization

<= CA$99mCA$100m - CA$149mCA$150m - CA$199mCA$200m - CA$249mCA$250m - CA$999m

>= CA$1 billion

10.5%5.3%2.6%7.9%18.4%55%

Figure 2: Characteristics of respondents and assessed organizations.

- 42 -

Figure 3: Characteristics of four dimensions of organizational maturity and overall maturity.

DimensionName

ItemsCronbach

Alpha /Composite

StandardDeviation Mean

Standardization • The extent to which the delivery approachand methodology are defined and used byall in the IS department (M1)

• The extent to which strict modelingstandards based on the architecture aredefined (M2)

• The extent to which systemsdocumentation is available, current andclear (M3)

• The extent to which user requirementsstandards are established andimplemented (M4)

• The extent to which developmentstandards are established andimplemented (M5)

0.8837 7.3444 17.7105

ProjectManagement

• The extent to which systems costs/benefitsanalysis are formally produced andmonitored (M6)

• The extent to which the delivery of systemsis consistently on time (M7)

• The extent to which a project manager isclearly identified for every project (M8)

• The extent to which a project plan isproduced, updated and communicated onevery project (M9)

• The extent to which project reviews weredefined and implemented (M10)

• The extent to which systemimplementation and deployment is smoothand well planned (M11)

0.9056 9.4299 21.8592

Tools • The extent of use of project managementtools (M12)

• The extent to which methodology andtechniques are integrated with tools (M13)

• The extent to which tools are available toeveryone in the IS department (M14)

•` The adequacy of the support provided fortools (M15)

0.8755 6.5075 13.9559

Organization • The extent to which the overallorganization's strategy, missions, goals,tactics and priorities are clearlydocumented and understood (M16)

• The extent to which the IS organization'sstrategy, goals, and priorities are clearlydocumented and understood (M17)

• The strength of the alignment between thebusiness strategy and systems projects(M18)

• The IS organization's ability to absorb andimplement innovations (M19)

• The extent to which the budgeting processof the IS organization is integrated withorganizational priorities (M20)

0.8155 6.6178 21.9941

OverallMaturity

• Standardization• Project Management• Tools• Organization

0.9486 23.3464 75.5197

- 43 -

Standardization

Project Management

Tools

Organization

Maturity Profile

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

Org. P Org. Q

Figure 4: The profiles of two organizations on the four dimensions of maturity.

- 44 -

Figure 5: The standard error of measurement for difference scores between dimensions.

- 45 -

Figure 6: The standard error of measurement for difference scores within dimensions.

- 46 -

Figure 7: Reliability, standard deviations, and means of the two requirements engineeringsuccess dimensions.

- 47 -

Figure 8: Correlations (Pearson and Spearman) and corrected correlations between maturityand requirements engineering success.

- 48 -

Figure 9: A classification of reliability estimation methods.

- 49 -

Figure 10: Covariance matrix of the first dimension of maturity: standardization.

- 50 -

Item Factor 1 Factor 2 Factor 3 Factor 4Item-total

Correlations

M1M2M3M4M5

0.65670.85850.65030.77880.7863

0.65100.83720.64750.67170.8437

M6M7M8M9

M10M11

0.73440.71920.67220.63520.87580.8272

0.59270.67440.74040.72950.89100.8477

M12M13M14M15

0.73110.73550.90220.8151

0.60230.79840.85850.6943

M16M17M18M19M20

0.83150.69860.56940.72160.6956

0.78730.72180.51390.61370.4138

Figure 11: Factor analysis results and item-total correlations for the four dimensions of maturity.

Sta

ndar

diza

tion

Pro

ject

Man

agem

ent

Too

lsO

rgan

izat

ion

the reliability of measuring organizational maturity · software process maturity and software...

Documents