the reliability of measuring organizational maturity · software process maturity and software...
TRANSCRIPT
THE RELIABILITY OF MEASURING
ORGANIZATIONAL MATURITY
KHALED EL EMAM
NAZIM H. MADHAVJI
- 1 -
To appear in Software Process Improvement and Practice, John Wiley & Sons, 1995.
THE RELIABILITY OF MEASURING
ORGANIZATIONAL MATURITY✧
KHALED EL EMAM=
NAZIM H. MADHAVJI*==CENTRE DE RECHERCHE INFORMATIQUE DE MONTREAL (CRIM)
*SCHOOL OF COMPUTER SCIENCE, MCGILL UNIVERSITY
Abstract
One of the recently developed classes of decision making tools for software engineering
management is the organizational maturity assessment. The scores from such assessments are being
applied in focusing and tracking self-improvement efforts, and as part of the contract award decision
making process. However, until now, the important issue of the reliability of assessments has rarely
been addressed by the developers and users of assessment methods. The extent of reliability
describes the degree to which assessment scores are consistent and repeatable. In this paper we
present some basic techniques for the estimation of the reliability of maturity assessments. We then
demonstrate through a case study some of these techniques and how to apply the reliability
estimate(s) in decision making situations. Examples of decisions are: comparing organizations'
maturity scores and evaluating the relationship between maturity scores and some criterion of
effectiveness. For the latter, we found a weak relationship betwen maturity and our measure of
effectiveness, namely the success of the requirements engineering process.
Keywords: software process assessment, organizational maturity, reliability, empirical study.
- 2 -
✧ This work was supported, in part, by the IT Macroscope Project and NSERC Canada.
1 Introduction
1.1 Use of Maturity Assessments and Assessment Scores
In recent years there has been a marked increase in the number of methods for assessing the
software process maturity and software process capability of organizations1 developing software
and/or developing software-based systems. The basic premise behind these methods is that
assessment scores are positively associated with project and organizational effectiveness (e.g.,
productivity, software quality, user satisfaction, etc.).
Most prominent amongst the assessment methods is the Software Capabil i ty
Evaluation (SCE) (SEI 1994a; 1994b) based on the Capability Maturity Model (CMM) for
software developed by the SEI (Paulk et al. 1993a; 1993b). Other methods and models exist2, for
example, Trillium (Bell Canada 1992; Coallier 1995), Bootstrap (Koch 1993; Kuvaja et al. 1994),
the current effort on the SPICE project (Dorling 1993; Drouin 1994a; 1994b), and variants of the
CMM based methods (Drew 1992; Thompson et al. 1992). Furthermore, some assessment
methods are becoming de facto standards, such as the CMM based methods developed at the
SEI, while others are intended to be international standards, such as those based on the SPICE
framework (Konrad 1994; Paulk and Konrad 1994b).
The contexts within which assessment methods have been applied include: (a) self-
assessments, and (b) maturity determination. Self-assessments are voluntary and their intended
purpose is for an organization to improve its own maturity. For example, in Motorola, maturity
self-assessments are employed to motivate and evaluate process improvement
progress (Daskalantonakis 1994). Maturity determination is commonly performed by one
organization to evaluate the maturity of its suppliers on an on-going or contract-award basis. For
example, the results of such assessments reportedly have significant weights in some contract
award decisions made by the U.S. Navy (Rugg 1993), and minimum maturity scores are
expected to be a requirement for some U.S. Air Force contracts (Saiedian and Kuzara 1995).
- 3 -
1 As pointed out in (Paulk and Konrad 1994a), there is a difference between a maturity assessment and a processcapability assessment. The former produces an overall software process maturity score(s) for an organization, whilethe latter produces score(s) for the implementation and institutionalization of specific process(es) in an organization,i.e., process measure(s) rather than organization measure(s). In this paper we focus more on organizationalmaturity. However, this does not result in any loss of generality in the presentation.
2 Many of these models are also partially based on the CMM for software developed by the SEI.
Important decisions are made by organizations based on assessment scores. For example, a
self-assessment that identifies strengths and weaknesses in an organization might lead the
organization to focus future improvement efforts and resources on rectifying the weaknesses.
The consequences of an erroneous interpretation of scores in a self-assessment situation could
be mis-allocation or inefficient allocation of resources. Also, a supplier who has been determined
to have a relatively low maturity score is less likely to be awarded a contract. The consequences
of an erroneous interpretation of scores in a contract-award situation could be the losers
contesting awards and starting costly litigation.
Some authors have discouraged an over-reliance on obtained assessment scores
(Besselman et al. 1993; Humphrey and Curtis 1991). In practice, however, characterizing
organizations quantitatively using such a score gives a sense of objectivity to decision making
and legitimizes actions taken. This indeed makes maturity scores an attractive proposition. For
example, at Motorola, Daskalontanakis (1994) states "Low scores identify key activities and key
process areas that need immediate attention to raise the organization's software process
capability". Haas et al. (1994) describe how maturity scores can be used to identify
organizational weaknesses with respect to ISO 9001 certification. In a contract award case,
Rugg (1993) states that "software capability - as measured in the submitted proposal and on site
- counted as one third of the weight for consideration to award the contract". Also, a letter from
the U.S. Department of the Air Force (dated September 1991) stated (Saiedian and Kuzara
1995) “We wish to point out that at some point in the near future, all potential software
developers will be required to demonstrate a software maturity Level 3 [on the CMM] before they
can compete in ESD/RL [...] major software development initiatives”.
1.2 Reliability of Assessments
Given the importance of the decisions made by organizations based on assessment scores, two
questions that need to be asked are: ”how reliable are such assessments?” and “what are the
implications of reliability for interpreting assessment scores?”. Reliability is defined as the extent
to which the same measurement procedure will yield the same results on repeated
trials (Carmines and Zeller 1979).
- 4 -
Recent literature and practice have reflected a concern with the reliability of assessments. For
example, Card discusses the reliability of SCEs in a recent article (Card 1992), where he
commented on the inconsistencies of the results obtained from assessments of the same
organization by different teams. Mention is also made of reliability in a contract award situation
where emphasis is placed on having one team assess different contractors to ensure
consistency (Rugg 1993). Bollinger and McGowan (1991) criticize the extent to which the scoring
scheme used in SCEs contributes towards reduced reliability. The Interim Profile method of the
SEI (Whitney et al. 1994) includes specific indicators to evaluate reliability. Furthermore, a deep
concern with reliability is reflected in the empirical trials of the prospective SPICE standard
whereby the evaluation of the reliability of SPICE-conformant assessments is an important focus
of study (El Emam and Goldenson 1995).
Of all the maturity assessment methods that are in the published literature, only one mention
of an actual reliability study was made by Humphrey and Curtis (1991). In that article they briefly
describe a study of level 2 and 3 questions on the preliminary version of the SEI maturity
questionnaire where the reliability (as estimated by an internal consistency method which will be
described later in this article) was found to be very high (0.9). They, however, omit the details of
the study. Furthermore, use of that reliability estimate in decision making is not standard
procedure, and, to our knowledge, is rarely ever done in practice.
The extent of unreliability has at least two important implications. First, the score obtained
from an assessment is only one of the many possible scores that would be obtained had the
organization been repeatedly assessed. This means that, for a given level of confidence that one
is willing to tolerate, an assessment score has a specific probability of falling within a range of
scores. The size of this range increases as reliability decreases. Second, when testing the
hypothesis that maturity is positively associated with the performance of organizations,
unreliability tends to attenuate the magnitude of the relationship. This means that the
relationship, which could be described using a correlation coefficient, is artificially reduced due to
unreliability. Both of the above implications ought to be seriously considered when presenting
and comparing assessment scores, and when empirically investigating relationships between
assessment scores and some criterion.
- 5 -
The reliability of maturity assessment methods can be estimated. Estimates of reliability would
allow one to determine the maturity score range for a given confidence level, and would also
allow one to make corrections for attenuation in correlation coefficients (Nunnally 1978). It is
therefore critical that estimates of the reliability of maturity assessment procedures be
determined and applied in decision making.
The objectives of this paper are to show how to estimate the reliability of maturity
assessments, and how to apply the estimate in decision making (e.g., comparing maturity scores
from two organizations and determining the relationship between maturity scores and criterion
measures of effectiveness). We demonstrate reliability estimation and its application through a
case study.
Section 2 presents a review of some basic reliability concepts. The research method for the
case study application is presented in section 3. In section 4, we demonstrate how to estimate
reliability and how to apply the estimate in decision making. Section 5 concludes the paper with a
discussion of the case study application results, and their implications for future research.
2 The Reliability of Measurement
In this section we introduce the background and some concepts regarding the reliability of
measurement. The intention here is only to place the remainder of the paper in context, and
therefore the discussion is admittedly brief. Further details may be found in the texts of Nunnally
(1978) and Lord and Novick (1968).
2.1 Overview
Much theoretical and analytical work related to measurement and the reliability of measurement
has been done in the fields of psychometrics and educational testing and measurement. This
work is subsumed under the heading of test theory. From a historical perspective, what is
considered to be the first full work on test theory is the text of Thorndike (1904). Since then, a
large body of work has expanded, refined, and added to his original theory. Part of this body of
knowledge is known collectively as classical test theory.
- 6 -
Reliability is considered by many psychometricians to be the fundamental problem in
measurement (Ghiselli et al. 1981). The reliability of measurement is concerned with random
error or nonsystematic influences on measurement operations. When we talk about the reliability
of measurement, we refer to the precision with which a particular attribute of a concept, object or
phenomenon is being measured. Defining and estimating reliability is important because it is
common that repeated measurement of the same attribute of an object, concept, or phenomenon
will not yield exactly the same quantitative values. If the magnitude of the underlying attribute
does not change across repeated measurements, then the fluctuations in the measured values
are considered to be measurement error.
For example, let's say we wish to measure the length of an object using a ruler, and this
measurement operation is repeated over and over again. If the markings on the ruler are
sufficiently close together so that the measurements are obtained to the nearest hundredth of an
inch, then the repeated measurement operations will likely yield several different lengths. This
kind of measurement is not perfectly reliable because of the differences in the lengths obtained
(assuming that the length of the measured object did not change).
2.2 Reliability vs. V alidity
The reliability of measurement is different from the validity of measurement. Reliability is
concerned with the extent to which measurement is repeatable and consistent. Validity is
concerned with the extent to which a measurement operation is measuring what it purports to
measure.
One could, for example, seek to measure intelligence by having children throw stones as far
as they could. The distance the stones were thrown on one occasion might correlate highly with
how far they are thrown on another occasion. Thus, being repeatable, the measure would be
highly reliable. However, the distance that stones are thrown would not be considered by many
observers to be a valid measure of intelligence.
The amount of measurement error places a limit on the amount of validity that a measurement
operation can have. But, even in the complete absence of measurement error, there is no
guarantee of validity. Reliability is a necessary but insufficient condition for validity.
- 7 -
2.3 Basic Concepts
A basic concept for comprehending the reliability of measurement is that of a construct. A
construct refers to a meaningful conceptual object. A construct is neither directly measurable nor
observable. However, the quantity or value of a construct is presumed to cause a set of
observations to take on a certain value. An observation can be considered as a question in a
maturity questionnaire (this is also referred to as an item). Thus, the construct can be indirectly
measured by considering the values of those items.
For example, organizational maturity is a construct. Thus, the value of an item measuring “the
extent to which projects follow a written organizational policy for managing system requirements
allocated to software” is presumed to be caused by the true value of organizational maturity.
Also, the value of an item measuring “the extent to which projects follow a written organizational
policy for planning software projects” is presumed to be caused by the true value of
organizational maturity3. Such a relationship is depicted in the path diagram in Figure 1. Since
organizational maturity is not directly measurable, the above two items are intended to estimate
the actual magnitude or true score of organizational maturity.
Since reliability is concerned with random measurement error, error must be considered in
any theory of reliability. The classic theory states that an observed score consists of two
components, a true score and an error score: X = T + E. Thus, X is the score obtained in a
maturity assessment, T is the mean of the theoretical distribution of X scores that would be found
in repeated assessments of the same organization using the same maturity assessment
procedure, and E is the error component.
The true score is considered to be a perfect measure of maturity. In practice, however, the
true score can never be really known since it is generally not possible to obtain a large number of
- 8 -
3 The two items used in the example are based on the SEI's Maturity Questionnaire (Zubrow et al. 1994)].
***** Insert Figure 1 around here *****
repeated assessments of the same organization4,5. True scores are therefore only hypothetical
quantities, but useful nevertheless.
Measurement error is the difference between the observed score and the true score. It is a
property of the measurement operation and not of the organization's maturity. Considering that
the errors are random, then the observed scores obtained from repeated measurements are
sometimes higher and sometimes lower than the true score. Therefore, the error scores are
positive as frequently as they are negative. This means that, in the long run, the mean error is
zero.
The reliability of measurement is defined as the ratio of true score variance to observed score
variance. A reliability coefficient has a value between 0 (perfect unreliability) and 1 (perfect
reliability). Thus, as the error variance increases, reliability approaches 0; and as the error
variance approaches 0, reliability approaches 1.
There are a number of methods for estimating the reliability of measurement procedures that
are based on the above measurement model. In Appendix A of this article, we describe these
methods, as well as provide directions on how to conduct reliability studies in general.
3 Research Method
To demonstrate reliability estimation and the application of such estimates, the conduct and
results of an example case study application are presented. In this section, we describe the
research method of the case study application. This description includes how we developed the
maturity assessment instrument, the definition of the target population, the sampling procedure
that was followed, and an evaluation of biases due to non-response (since we are not gathering
data from the whole population, we have to ensure that our sample is representative).
- 9 -
4 The true score as defined here is not a Platonic true score in the sense that it represents the "true" maturity of anorganization.
5 If one is willing to make some assumptions (e.g., an assumption of linearity), point estimates of true scores can becomputed from obtained scores (Lord and Novick 1968)
3.1 Study Background And Context
The case study application described here was performed within the context of an Information
Systems (IS) consultancy firm based in North America with clients worldwide (henceforth referred
to as Company X). The purpose of the study was to construct a reliable instrument for providing
a general measure of IS organizational maturity, and to use the instrument for assessing the
maturity of this firm's clients. This instrument would also be used as part of the software process
diagnosis and improvement services that Company X provides. The instrument could be used by
senior consultants who are knowledgeable about the particular clients. This knowledge would be
gained mainly through their participation in clients' projects. The main constraint on this
instrument development effort was that the instrument had to be short. The context of its use and
the fact that senior consultants were to provide the responses dictated the above constraint.
The domain of analysis for this study was business information systems that are fully
customized for individual user organizations. The unit of analysis was the IS department or IS
function in an overall organization. An IS department develops, maintains, and/or acquires
business information systems. In the remainder of this text the IS department will be referred to
as an organization.
3.2 Instrument Development
The first activity of instrument development was to review the existing literature on IS
organizational maturity. The main sources of information for the instrument developed here were
the work of the SEI on the CMM (Humphrey 1988; 1989), other contemporary maturity models
such as TRILLIUM (Bell Canada 1992), and the much earlier work of Nolan on defining a
maturity model for IS organizations (Nolan 1973).
Based on this review, an initial set of criteria for assessing organizational maturity were
formulated. Subsequently, 30 senior consultants were interviewed to solicit their comments on
the correctness and completeness of these criteria as general measures of organizational
maturity. The characteristics of these consultants were as follows: 70% had project management
backgrounds, 45% had technical backgrounds, and 33% had research and education
- 10 -
backgrounds6. Also, 91% of the interviewees were located in Canada, and 9% were located in
the USA. This distribution was dictated strongly by resource constraints.
A set of documents were also inspected. These documents were produced by the particular
consultants interviewed to aid them in their practice. The documents constituted auditing and
assessment questionnaires and frameworks. Some of these were 'homegrown', while others are
based on some of the published assessment methods (e.g., the SEI's SCE method). The
document inspections gave us indications, beyond what is in the literature, on the format and
content of maturity assessment instruments used in IS organizations, wording of questions, and
practices that are considered important.
As a consequence of the interviews and document inspections, the initial set of criteria were
refined into an initial organizational maturity instrument. A semantic differential scale (see
Osgood et al. 1967) was used for all the items in this instrument. A small pilot study was then
conducted to identify ambiguities, inconsistencies, bad wording and to generally get feedback on
the usability and clarity of the instrument. For this, two senior consultants from Company X were
requested to score an organization in an interview setting with one of the authors of this paper
present. Each interviewee was requested to talk out loud while scoring, indicating what he
interprets each question to mean and the rationale for his scoring. Also, two other researchers
highly familiar with process assessments from the Software Engineering Laboratory at McGill
University reviewed the instrument and noted problems with it. Based on this pilot study, the
initial organizational maturity instrument was refined again. An abridged copy of the final version
of the instrument that is relevant for this paper is included in Appendix B.
For the purposes of this paper, we will focus on four specific dimensions of organizational
maturity that are measured by this instrument. The four dimensions are as follows: (a)
standardization, which is concerned with process and product standardization in the
organization, (b) project management, which is concerned with the extent to which good project
management practices are employed, (c) tools, which is concerned with effective automated tool
- 11 -
6 It should be noted that an interviewee may be characterized as having an intersection of backgrounds, and hencethe total does not add up to 100%. For example, some interviewees were initially lead architects (technicalbackground) and were subsequently involved in project management activities, or some interviewees areconducting research or acting as course instructors on primarily technical issues (for instance data modelingtechniques).
usage in the organization, and (d) organization, which is concerned mainly with the alignment of
the IS organization with the overall business.
3.3 Sampling Procedure
For this instrument, the target population is the 200 organizations worldwide7 that have licensed
the systems development methodology developed and marketed by Company X. Organizations
cannot assess themselves, therefore a set of individuals had to be identified that can perform the
instrument ratings. The first question that needed to be answered was whether employees of the
organizations or employees of Company X should score the instrument. The latter case would
constitute an external assessor. To answer this question, a small pilot study was conducted. For
this pilot study, each of three client organizations were visited and either one or two of their
employees were interviewed. Amongst the questions asked were those related to the
organization's maturity. These interviews indicated that the client personnel were likely to give
the "right" answers, and hence seemed to provide biased judgements. Therefore, only Company
X employees were considered for administering the maturity questionnaire.
The sample frame for our study was the client organizations that had good consulting
relationships with Company X (and hence there existed a consultant with good knowledge of the
organization's practices). In an attempt to construct a stratified sample, the only population
characteristics that were available to us were Gross Revenue (of Company X) by region and by
industrial sector. However, such information was deemed to be highly misleading since a sizable
percentage of Gross Revenue is obtained from a small number of client organizations, and
hence would not appropriately characterize the population.
An initial list of senior consultants that work for Company X was thus formulated. It was
considered by senior management and researchers of Company X that the consultants whose
names are on the list were involved with a highly representative cross-section of their clientele
(they covered all business regions in Canada and outside Canada and all organization sizes that
Company X did business with). Systematically, consultants were selected from the list and
contacted (by telephone, face-to-face, or by electronic mail) and requested to participate in the
study. Some consultants refused to participate. The non-refusals constituted our sample. All
- 12 -
7 Since our study spanned more than one calendar year, this figure is only approximate and represents informationwe obtained from Company X's Annual Report at the time sampling was initiated.
consultants in the sample were highly familiar with the assessed organizations through their
consulting assignments.
3.4 Response Rate And Non-Response Bias
In total, 86 questionnaires were sent out to the senior consultants. Of the 86 questionnaires, 7%
were sent to consultants who had since left the company. Of the remaining 80 questionnaires,
we received 42 responses (including late respondents). This gives a response rate of 52.5%. Of
all the responses, 4 were unusable due to extensive missing data, leaving a total of 38 usable
responses. This sample is considered to represent 19% of the target population.
A total of 18 responses were received before the response deadline. All non-respondents
after the response deadline were contacted and reminded to fill out the questionnaire. When non-
respondents and late respondents were contacted and asked why they had not yet responded,
their primary stated reason was that they were too busy. Thus, we consider that to be the main
reason for non-response8. The characteristics of all respondents and the IS organizations that
they assessed are summarized in Figure 2.
Late respondents are considered to provide a good measure of the characteristics of non-
respondents (Armstrong and Overton 1977). To test for non-response bias, early respondents
were compared with late respondents with respect to their demographic characteristics. Given
the relatively small sample, a decision rule had to be employed in choosing the most appropriate
test (Siegel and Castellan 1988). The demographic characteristic frequencies were tabulated in
r x c tables, where c was always 2 and r was either 2 or 3. For 2x2 tables, if all expected
frequencies are greater than 5, then a Chi-Square test was used. Otherwise, the Fisher exact
test was used. For 3x2 tables, if less than 20% of the expected frequencies were less than 5 and
if no cell had an expected frequency of less than 1, then the Chi-Square test was used.
Otherwise, cells were combined by merging rows, and the 2x2 table decision rule was applied.
- 13 -
***** Insert Figure 2 around here *****
8 Due to resource limitations, we could not pursue the reasons for non-response to a greater level of detail.
All tests of non-response bias were two-tailed and were conducted at an alpha level of 0.1.
For the characterization by location (Canada/Outside Canada), by budget (low/high), by number
of personnel (low/high), by business sector (government/non-government), and by position of
respondents (management/technical/coaching-auditing), the null hypothesis of no difference
could not be rejected. However, for the characterization by overall experience (low/high
determined around the mean), there was a significant difference. A closer examination showed
that respondents with technical positions tend to be more experienced in the respondents group
compared with the late respondents group. Hence, on demographic characteristics there seems
to be a slight bias.
Finn et al. (1983) have argued that demographic differences between respondents and non-
respondents do not automatically signal bias. To investigate this possibility, respondents and
non-respondents were compared on their four maturity scores. The Mann-Whitney U test was
used. For all four tests, the null hypothesis of no difference between the medians could not be
rejected. Hence, there seems to be no significant bias in responses between respondents and
non-respondents, even though a slight demographic difference was identified.
4 Reliability Estimates And Applications
The results of the study and their application are demonstrated in this section. It should be noted
that our instrument was intended to provide a general measure of IS organizational maturity
rather than facilitating detailed analysis of an IS organization's strengths and weaknesses, and
should therefore be interpreted and used in a congruent manner.
4.1 Reliability Estimates
The items covering each of the four organizational maturity dimensions mentioned earlier are
shown in Figure 3. For each of the four dimensions, the Cronbach alpha coefficient has been
- 14 -
***** Insert Figure 3 around here *****
computed. Cronbach alpha is one reliability estimate. Details of this coefficient and its
computation are given in Appendix A. According to the guidelines given by Nunnally (1978), each
of these reliability coefficients would be interpreted to be relatively high.
Since each of the four maturity dimensions was considered separately, there is the resultant
implication that the scales are homogeneous within each dimension and that they are
heterogeneous across dimensions. Scores on these homogeneous scales can be combined
linearly to produce a more heterogeneous overall maturity scale. One possible caveat with such
linear combinations is the difficulty in interpreting the combined score (Allen and Yen 1979).
With the above caveat in mind, the four dimensions were combined linearly into a single
overall maturity score. Cronbach alpha reliability cannot be used to estimate reliability of the
combination since its calculation presumes homogeneous traits (Allen and Yen 1979). Therefore,
an alternative estimate described by Nunnally (1978) is used. The reliability estimate for this
composite is shown in the last row of Figure 3.
In the remainder of this section we demonstrate two common applications of reliability
estimates. These two applications are: presenting and comparing assessment scores, and
investigating relationships. Both applications represent the kinds of decisions that are made
using maturity scores.
4.2 Presenting And Comparing Scores
Figure 4 shows the maturity scores of two different organizations on the four maturity
dimensions. In this figure, the raw scores have been transformed into z scores to facilitate score
comparisons9. The first thing to note is that the observed scores are within a band defined by
plus or minus one standard error of measurement10. This makes it clear for score interpretation
that there is an element of uncertainty associated with each observed maturity score (i.e., it is
only one of many possible scores that would be obtained had the organization been assessed
- 15 -
9 A z-score is a standard score. Standard scores are linear transformations of raw scores that have a predefinedmean and standard deviation (see Angoff 1971). A z-score has a mean of 0 and a standard deviation of 1. This iscomputed by subtracting the mean from the raw score and dividing the result by the standard deviation. The unit ofa z-score is the standard deviation. Converting a raw score to a z-score has the advantages that it conveysinformation about the relative standing of a score and makes it possible to compare scores on different maturityscales, as we do in this paper.
10 The standard error of measurement, as used in this paper, can be viewed as the standard deviation of thedifferences between a typical organization's true score and the observed score over a large number of repeatedassessments. Also see (Allen and Yen 1979; Lord and Novick 1968; Nunnally 1978) for more details.
repeatedly). Such uncertainty is affected by the extent of unreliability of the assessment
procedure (the lower the reliability, the greater the standard error of measurement).
When comparing the scores on the four maturity dimensions for a single organization, the
standard error of measurement of the difference scores should be taken into account. For the
data from our case study, these standard errors of measurement are tabulated in Figure 5. The
extent to which the difference scores are less than the standard error of measurement reduces
the significance of the difference.
For example, the z score on the standardization dimension for organization P is -0.778, and
its score on the tools dimension is -0.301. The difference score is 0.477. This is less than the
0.4908 value tabulated in Figure 5, and hence one can conclude that the difference between the
scores on the two dimensions is not significant. Therefore, if the organization were to allocate
improvement resources to increase its maturity, it is not obvious from the assessment scores
which of these two dimensions deserves the more immediate attention.
Conversely, for organization P, the difference between the standardization and project
management dimensions is relatively large (difference is 1.392) compared to the standard error
of measurement of the difference score tabulated in Figure 5 (0.4589). Therefore, it is clear that
the difference between these two scores is significant. This means that for organization P, there
is a difference between the standardization and project management dimensions that is not
simply an artifact of chance. Organization P has good reason to allocate scarce resources to
improve its score on the standardization dimension instead of on the project management
dimension.
- 16 -
***** Insert Figure 4 around here *****
***** Insert Figure 5 around here *****
If we wish to compare the maturity scores obtained by two different organizations, the
standard error of measurement of the difference scores should be considered. For the data from
our case study, the standard errors of measurement for differences on the same dimension are
tabulated in Figure 6.
For example, should we wish to compare scores on the organization dimension, we find that
the difference between organizations P and Q in Figure 4 (which is 0.302) is less than the
standard error of measurement as tabulated in Figure 6 (which is 0.6074). Therefore, the
difference of the scores on this dimension for the two organizations should not be considered as
significant. This means that on the organization dimension, it is not clear from the scores which
organization is better. Conversely, the difference on the tools dimension (which is 0.615) is
significant when compared to the tabulated value of 0.4992. Therefore, one can confidently
conclude that organization Q is better than organization P on the tools dimension.
The comparison applications and the arguments for consideration of measurement errors can
be easily extended to the case of an organization using maturity scores to track its improvement
progress. In such a case the difference scores would be change scores.
4.3 Investigating Relationships
The second application that is considered here is concerned with ascertaining the benefits of
maturity. Achieving higher organizational maturity would be meaningless unless there were some
benefits to be gained. One possible way of gauging benefits is to look at the success of
processes and projects within the organization.
As part of our case study, we tested the hypothesis that organizational maturity (the four
dimensions) is positively associated with the success of the requirements engineering process.
The logic behind the hypothesis is that we expect that individual projects would be more
successful the higher the maturity of the organization. One could argue against the usefulness of
increased maturity if it is not associated with greater success of individual projects.
- 17 -
***** Insert Figure 6 around here *****
We focused on the requirements engineering process because it is considered to be one of
the more important processes in software development. For example, previous research has
shown that it costs 5 to 10 times more to fix errors during coding than during the requirements
engineering process (Boehm 1981), and that it costs from 100 to 200 times more during post-
deployment evolution (Boehm 1981; 1987). One study found a strong positive relationship
between software system errors and errors identified in requirements specifications (Davis 1989).
In addition, other authors found that the requirements engineering phase is the source of the
majority of detected software code errors (Basili and Perricone 1984; Endres 1975; Jones 1994;
Rubey et al. 1975).
During our case study, we collected data on the success of the requirements engineering
process for one project in each assessed organization. Each of the respondents was requested
to assess the success of the requirements engineering process for a project which he/she has
been recently or is currently involved in where the requirements engineering phase has been
entered and exited at least once.
The specific instrument that was used for assessing requirements engineering success is
described in detail in (El Emam and Madhavji 1995). This instrument assesses two dimensions of
requirements engineering success: the quality of requirements engineering service (this has two
subdimensions: (a) user satisfaction and commitment, and (b) the fit of the recommended
solution with the user organization), and quality of requirements engineering products (this has
two subdimensions: (a) the quality of the architecture, and (b) the quality of the costs/benefits
analysis). The reliability estimates for both of these dimensions are shown in Figure 7, along with
their standard deviations and means.
The Pearson product moment correlation was computed between each of the 4 dimensions of
maturity and each of the 2 dimensions of requirements engineering success. The resulting
correlation coefficients are shown in Figure 8. Given that the measurement of each of the
- 18 -
***** Insert Figure 7 around here *****
variable pairs was not perfectly reliable, the correlation coefficients were corrected for attenuation
(see Nunnally 1978). The outcomes of this correction are also shown in Figure 8. As can be
seen, the correction did not change the magnitude of the correlation remarkably. It nevertheless
demonstrates the concept of attenuation due to unreliability.
The results of this analysis reveal that only the organization dimension of maturity is
significantly related to the quality of service. However, given that approximately only 30% of the
variation in quality of service is explained by the organization dimension, such a relationship is
not very strong. The correlations between the overall maturity score and requirements
engineering success is also shown in Figure 8. A small significant relationship was found
between maturity and quality of service, while no relationship was found with quality of products.
Thus, we cannot present any strong evidence showing a strong positive association between
organizational maturity and the success of the requirements engineering process.
The scaling model used for the measurement of the organizational maturity dimensions is the
summative (or "Likert", as it is also referred to) model (McIver and Carmines 1981). In this model,
the scores on all the items in a dimension are summed to obtain the total dimension score.
Galletta and Lederer (1989), when discussing the User Information Satisfaction instrument
(which also uses a summative model), consider the resultant measure to be on an ordinal
scale11. Given this interpretation, it would be more appropriate to use a nonparametric correlation
coefficient in our analysis, such as the Spearman rank correlation coefficient (Siegel and
Castellan 1988).
The Spearman rank correlation coefficients for all the hypothesized relationships are shown in
Figure 8. As can be seen, a statistically significant relationship between the organization
dimension and quality of service is indicated. Also, a weak, but statistically significant,
- 19 -
11 McIver and Carmines (1981), however, consider that the summative scaling model produces interval level scales. Ingeneral, and following Tukey's perspective (Tukey 1986) that if a scale is not interval, it does not necessarily haveto be merely ordinal, our maturity scales are expected to occupy the grey region between ordinal and interval levelmeasurement. We therefore present, both, the parametric and nonparametric measures of association. As is seenin the analysis, based on our data, our conclusions would not differ using either of the measures of association.
***** Insert Figure 8 around here *****
relationship between overall maturity and quality of service was found. No relationship between
maturity and quality of products was found. Moreover, small negative correlations between the
standardization and project management dimensions and the quality of products were found.
However, these were not significant.
We also conducted a posthoc power analysis12 of the results of the correlational analysis. The
tables in (Cohen 1977; Kraemer and Thiemann 1987) were used for the Pearson and Spearman
coefficients. A 0.05 a-level and one-tailed tests were used in this analysis. In the case of the
quality of products dimension, all tests had a power of less than approximately 15%. In the case
of the quality of service dimension, the standardization, project management and tools tests had
power of less than approximately 50%; the organization dimension tests had a power of
approximately 95%; and the overall maturity tests had a power of approximately 81% for the
Pearson correlation and approximately 60% for the Spearman correlation.
It can be seen that the tests of significance were, in general, not powerful enough to detect a
significant relationship. The generally low magnitudes of the correlation coefficients contribute to
the low power, as well as the small sample sizes. Ideally, one would strive for a power value
between 90% and 95%13.
The results of the above correlational analysis should be qualified by a number of
explanations. First of all, no causal relationship was implied in the hypothesis nor in the
subsequent analysis. This is especially true given that the data was collected cross-sectionally
rather than longitudinally. Second, given our limited domain of analysis and sample frame, and
the lack of internal or external replication of this study, great caution should be taken in
generalizing the lack of strong relationships between maturity and requirements engineering
success. Third, it is plausible that our measures are not highly valid. Strictly speaking, ensuring
validity is an on-going process. While we have taken great care in producing valid measures (a
summary of the evidence on validity is given in appendix C), further studies are necessary to
confirm our validity claims. Fourth, contingency variables may be moderating the relationships, a
- 20 -
12 Statistical power is defined as the probability that a statistical test will correctly reject the null hypothesis. It isgenerally recommended that a power analysis be conducted especially when the results of a research study do notfind significant relationships between the variables of interest (Baroudi and Orlikowski 1989).
13 It is informative to note that to attain 90% power with effect sizes so small (e.g., correlation coefficients between 0.1and 0.3), an approximate sample size of 93 is required for r=0.3, and an approximate sample size of 864 is required
for r=0.1 (from the tables in (Cohen 1977) using a=0.05 for a one tailed test).
possibility which was not considered here. However. given that the software process community
is in the formative stage of theory development within the domain of organizational maturity
research, the assumptions made here are not inappropriate. Finally, the generally low power of
the tests of significance may be the reason for the inability to reject the null hypothesis of no
relationship.
To summarize then, we have shown that reliability estimates can be applied in correcting the
attenuation in relationships. In our specific case study, we investigated the relationship between
maturity and requirements engineering success. Only a minor relationship was found between
the organization dimension of maturity and the quality of requirements engineering service. No
relationship was found between maturity and quality of requirements engineering products.
5 Conclusions
The objectives of this paper were to review some of the basic concepts and methods concerning
the reliability of measurement and reliability estimation, and to show how these estimates can be
used in decision making. This review should prove useful to both researchers and practitioners
intending to develop instruments for assessing organizational maturity by providing them with an
initial understanding of the concepts and pointers for further exploration. Furthermore, we hope
to have created an awareness of measurement error in assessment scores and hence have
contributed towards improving the interpretation of such scores.
We have also presented an example case study that was intended to provide practitioners
and researchers with an initial reference example for their instrument development efforts. For
our case study, we developed an instrument-based assessment method with reliabilities in the
range of 0.8-0.9, which is considered to be high. This is consistent with the reliability coefficient
reported in (Humphrey and Curtis 1991) for another assessment method.
In the case study, we demonstrated how to use reliability estimates in decision making
situations. Specifically, two decision making situations were considered. First, when comparing
assessment scores, reliability should be taken in account to determine whether the difference is
significant. Second, when investigating the relationship between maturity scores and some
- 21 -
criterion of effectiveness, reliability estimates can be used correct the magnitude of the
relationship for attenuation due to unreliability.
A further finding from our case study was the lack of strong relationships between our
measures of organizational maturity and our criterion measure of effectiveness, namely the
success of the requirements engineering process. Both, parametric and nonparametric
correlations were found to be generally low and many were not significantly different from zero. A
posthoc power analysis indicated that for effect sizes of such a small magnitude, much larger
sample sizes would be required in future empirical studies.
Given the rise in the use of maturity assessment methods in industry, and given the
implications of decisions made based on the results of such assessments, it would be prudent to
increase the reliability of these methods and their application. Guidelines for increasing the
reliability of assessment methods and their applications are given in Appendix D of this paper. It
would also be prudent of developers of such methods to ascertain how reliable they are and to
publish the details of their reliability studies. Where no reliability estimates are provided, it would
be prudent of users of such methods to be aware of the possibility of measurement error and to
reduce their reliance on such methods or at least their reliance on assessment scores in their
own decision making.
Acknowledgements
The authors wish to thank Jean-Normand Drouin, Dennis Goldenson, and the anonymous
referees for their valuable comments on an earlier version of this paper.
References
Allen, M. and Yen, W. 1979. Introduction to Measurement Theory. Brooks/Cole PublishingCompany.
Angoff, W. 1971. Scales, norms, and equivalent scores. Educational Measurement, R. Thorndike(ed.), American Council on Education.
Armstrong, J. and Overton, T. 1977. Estimating nonresponse bias in mail surveys. Journal ofMarketing Research, Vol. XIV, August, pages 396-402.
Baroudi, J. and Orlikowski, W. 1989. The problem of statistical power in MIS research. MISQuarterly, March, pages 87-106.
- 22 -
Basili, V. and Perricone, B. 1984. Software errors and complexity: An empirical investigation.Communications of the ACM, 27(1):42-52.
Bell Canada 1992. TRILLIUM: Telecom Software Product Development Process CapabilityAssessment. Technical Report, Bell Canada.
Coallier, F. 1995. TRILLIUM: A model for the assessment of telecom product development &support capability. Software Process Newsletter, IEEE Computer Society, Winter, No. 2,pages 3-8.
Cohen, J. 1977. Statistical Power Analysis for the Behavioral Sciences. Academic Press.
Besselman, J., Byrnes, P., Lin, C., Paulk, M. and Puranik, R. 1993. Software CapabilityEvaluations: Experiences from the field. SEI Technical Review.
Boehm, B. 1981. Software Engineering Economics, Prentice Hall.
Boehm, B. 1987. Industrial software metrics top 10 list. IEEE Software, September, pages 84-85.
Bohrnstedt, G. 1970. Reliability and validity assessment in attitude measurement. AttitudeMeasurement, G. Summers (ed.), Rand-McNally, pages 80-99.
Bollinger, T. and McGowan, C. 1991. A critical look at Software Capability Evaluations. IEEESoftware, July, pages 25-41.
Card, D. 1992. Capability evaluations rated highly variable. IEEE Software, September,pages 105-106.
Carmines, E. and Zeller, R. 1979. Reliability and Validity Assessment, Sage Publications,Beverly Hills.
Cohen, J. and Cohen, P. 1983. Applied Multiple Regression / Correlation Analysis for theBehavioral Sciences, Lawrence Erlbaum Associates.
Cronbach, L. 1951. Coefficient alpha and the internal consistency of tests. Psychometrika,September, pages 297-334.
Daskalantonakis, M. 1994. Achieving higher SEI levels. IEEE Software, July, pages 17-24.
Davis, J. 1989. Identification of errors in software requirements through use of automatedrequirements tools. Information and Software Technology, November, 31(9):472-476.
Dorling, A. 1993. SPICE: Software Process Improvement and Capability dEtermination.Information and Software Technology, June/July, 35(6/7):404-406.
Drew, D. 1992. Tailoring the Software Engineering Institute's (SEI) Capability Maturity Model(CMM) to a software sustaining engineering organization. Proceedings of the InternationalConference on Software Maintenance, pages 137-144.
Drouin, J-N 1994a. Software quality - An international concern. Software Process, Quality & ISO9000, August, 3(8):1-4.
Drouin, J-N 1994b. The SPICE project: An overview. Software Process Newsletter, IEEEComputer Society, Winter, No. 2, pages 8-9.
El Emam, K. and Goldenson, D. R. 1995. SPICE: An empiricist's perspective. Proceedings of theSecond International Software Engineering Standards Symposium (to appear).
El Emam, K. and Madhavji, N. H. 1995. Measuring the success of requirements engineeringprocesses. Proceedings of the Second IEEE International Symposium on RequirementsEngineering, March, pages 204-211.
Endres, A. 1975. An analysis of errors and their causes in system programs. IEEE Transactionson Software Engineering, June, 1(2):140-149.
Finn, D., Wang, C-K, and Lamb, C. 1983. An examination of the effects of sample compositionbias in a mail survey. Journal of the Marketing Research Society, 25(4):331-338.
- 23 -
Galletta, D. and Lederer, A. 1989. Some cautions on the measurement of User InformationSatisfaction. Decision Sciences, 20:419-438.
Ghiselli, E., Campbell, J., and Zedeck, S. 1981. Measurement Theory for the BehavioralSciences, W. H. Freeman.
Guilford, J. 1954. Psychometric Methods, McGraw-Hill.
Haase, V., Messnarz, R., Koch, G., Kugler, H., and Decrinis, P. 1994. Bootstrap: Fine-tuningprocess assessment. IEEE Software, July, pages 25-35.
Humphrey, W. 1988. Characterizing the software process: A maturity framework. IEEE Software,March, pages 73-79.
Humphrey, W. 1989. Managing the Software Process, Addison-Wesley.
Humphrey, W. and Curtis, B. 1991. Comments on 'A Critical Look'. IEEE Software, July,pages 42-46.
Japan SC7 WG10 SPICE Committee 1994. Report of Japanese trial process assessment bySPICE method. A SPICE Project Report.
Jones, C. 1994. Assessment and Control of Software Risks, Prentice-Hall.
Kerlinger, F. 1986. Foundations of Behavioral Research. Harcourt Brace Jovanovich, Orlando,FL.
Koch, G. 1993. Process assessment: The 'BOOTSTRAP' approach. Information and SoftwareTechnology, June/July, 35(6/7):387-403.
Konrad, M. 1994. On the horizon: An international standard for software process improvement.Software Process Improvement Forum, September/October, pages 6-8.
Kraemer, H. and Thiemann, S. 1987. How Many Subjects? Statistical Power Analysis inResearch, Sage Publications, Beverly Hills.
Kuvaja, P., Simila, J., Kranik, L., Bicego, A., Saukkonen, S., and Koch, G. 1994. SoftwareProcess Assessment and Improvement: The Bootstrap Approach, Blackwell.
Lord, F. and Novick, M. 1968. Statistical Theories of Mental Test Scores, Addison-Wesley.
McIver, J. and Carmines, E. 1981. Unidimensional Scaling. Sage Publications, Beverly Hills.
Nolan, R. 1973. Managing the computer resource: A stage hypothesis. Communications of theACM, July, 16(7):399-405.
Nunnally, J. 1978. Psychometric Theory, McGraw-Hill.
Osgood, C., Suci, G., and Tannenbaum, P. 1967. The Measurement of Meaning, University ofIllinois Press.
Paulk, M. and Konrad, M. 1994a. Measuring Process Capability Versus Organizational ProcessMaturity. Proceedings of the 4th International Conference on Software Quality.
Paulk, M. and Konrad, M. 1994b. ISO seeks to harmonize numerous global efforts in softwareprocess management. Computer, April, pages 68-70.
Paulk, M., Curtis, B., Chrissis, M-B, and Weber, C. 1993a. Capability Maturity Model, version 1.1.IEEE Software, July, pages 18-27.
Paulk, M., Curtis, B., Chrissis, M-B, and Weber, C. 1993b. Capability Maturity Model, version 1.1.Technical Report CMU/SEI-93-TR-24, Software Engineering Institute.
Rubey, R., Dana, J., and Biche, P. 1975. Quantitative aspects of software validation. IEEETransactions on Software Engineering, June, 1(2):150-155.
Rugg, D. 1993. Using a capability evaluation to select a contractor. IEEE Software, July,pages 36-45.
- 24 -
Saiedian, H. and Kuzara, R. 1995. SEI Capability Maturity Model's impact on contractors.Computer, January, 28(1): 16-26.
Sethi, V. and King, W. 1991. Construct measurement in information systems research: Anillustration in strategic systems. Decision Sciences, 22:455-472.
Siegel, S. and Castellan, J. 1988. Nonparametric Statistics for the Social Sciences, McGraw Hill.
SEI 1994a. Software Capability Evaluation Version 2.0: Method Description. Technical Report,CMU/SEI-94-TR-06, Software Engineering Institute.
SEI 1994b. Software Capability Evaluation (SCE) Version 2.0: Implementation Guide. TechnicalReport, CMU/SEI-94-TR-05, Software Engineering Institute.
Subramanian, A. and Nilakanta, S. 1994. Measurement: A blueprint for theory-building in MIS.Information and Management, 26:13-20.
Thompson, K., Ince, D., Madden, P., and Angelone, E. 1992. Practical quality improvementthrough software process maturity. Technical Report, Institute of Software Engineering,Belfast.
Thorndike, E. 1904. An Introduction to the Theory of Mental and Social Measurements, SciencePress.
Traub, R. 1994. Reliability for the Social Sciences: Theory and Applications, Sage Publications,Beverly Hills.
Tukey, J. 1986. Data analysis and behavioral science or learning to bear the quantitative man'sburden by shunning badmandments. The Collected Works of John W. Tukey, Vol. III, L.Jones (ed.), pages 187-389, Wadsworth & Brooks/Cole.
Whitney, R., Nawrocki, E., Hayes, W., and Siegel, J. 1994. Interim Profile: Development and trialof a method to rapidly measure software engineering maturity status. Technical Report,CMU/SEI-94-TR-4, Software Engineering Institute.
Zubrow, D., Hayes, W., Siegel, J., and Goldenson, D.R. 1994. Maturity Questionnaire. TechnicalReport, CMU/SEI-94-SR-7, Software Engineering Institute.
- 25 -
Appendix A : Estimating Reliability
This appendix describes how to conduct reliability studies. This includes some general
considerations as well as specific reliability estimation methods.
Reliability estimation methods would usually be applied during a "pretest" study whose
objective would be to estimate the reliability of the particular assessment procedures that are
prescribed. Reliability estimates (or coefficients) are calculated from assessment scores of a
sample of organizations, not the whole population of organizations. Thus, a pretest study
provides a sample estimate of the reliability for the whole population. For different samples, it is
likely that different sample estimates would be obtained.
When conducting a pretest study, at least the following three issues should be taken into
consideration:
1 Sample Representativeness
The sample of organizations that are chosen should represent a well-defined population.
The extent to which the reliability coefficients can be generalized beyond the pretest sample
itself depends on the representativeness of the sample.
2 Identical Assessment Procedures and Conditions
The procedures and conditions under which the assessment was performed during the
pretest study should be similar to those that will exist in real applications of the prescribed
assessment procedures. Otherwise, the reliability coefficients obtained during the pretest
study may not pertain to actual applications of the assessment procedures.
3 Independence of Assessments
Assessments conducted during a pretest study should be independent of each other. This
means that the assessment of one organization should not be influenced by nor have an
influence on any other organization's assessment.
There are four basic methods for estimating reliability. All of the four methods attempt to
determine the proportion of variance in a measurement scale that is systematic. In general, with
these methods one correlates a score from a particular scale with scores from some form of a
- 26 -
replication of the scale. If the correlation is high, most of the variance is of the systematic type.
The different methods can be classified by the number of different assessment procedures
necessary and the number of assessment occasions necessary. This classification is depicted in
Figure 9.
Test-Retest Method
This is the simplest method for estimating reliability. In our context, one would assess each
organization's maturity at two points in time using the same assessment procedure. Reliability
would be estimated by the correlation between the scores obtained on the two assessments.
This method has the advantage of requiring only one form of the assessment procedure.
The primary disadvantages of using this method are threefold. First, it is often expensive and
even impractical to conduct maturity assessments at more than one point in time. Prior
experience has identified the costs of assessments as a concern (Besselman et al. 1993; Japan
SC7 WG10 SPICE Committee 1994). Second, it is not obvious that a low reliability coefficient
obtained using this method indicates low reliability. Another possible reason for a low reliability
coefficient is that the organization's maturity has changed between the two assessments. For
example, the initial assessment results could sensitize an organization to specific weaknesses
which may prompt them to initiate an improvement effort that influences the result of subsequent
assessments. This would generally lead to an underestimate of reliability. Third, carry-over
effects between assessments may lead to an overestimate of reliability. For example, the
reliability coefficient can be artificially inflated due to memory effects like the assessees knowing
the 'right' answers that they have learned from previous assessments and assessors
remembering responses from previous assessments and, deliberately or otherwise, repeating
them in an attempt to maintain consistency of results.
Estimating reliability using the test-retest method is troublesome because it is not easy to
determine an appropriate time interval between assessments. If one increases the interval to
- 27 -
***** Insert Figure 9 around here *****
minimize carry-over effects, then one is also increasing the likelihood that the true organizational
maturity has changed. Due to its heavy focus on stability over time, the test-retest reliability
coefficient is also sometimes referred to as the stability coefficient.
Alternative Forms Method
Instead of using the same assessment procedure on two occasions, the alternative forms
method stipulates that two alternative assessment procedures be used. This can be achieved, for
example, by using two different maturity questionnaires or having two alternative, but equally
qualified, assessors (or assessments teams) for the two occasions.
This method can be characterized either as immediate (where the two occasions are
concurrent in time), or delayed (where the two occasions are separated in time). The correlation
coefficient is then used as an estimate of reliability of either one of the alternate forms. If
assessments are made on more than two occasions, the usual practice is to take the average of
the inter-correlations as an estimate of reliability. Interpreting the correlation coefficient in such a
manner is only applicable if the obtained scores satisfy the criteria for parallel tests14 or are linear
functions of parallel test scores.
In the delayed alternative forms method, the disadvantages are similar to those for the test-
retest method. Furthermore, for both (immediate and delayed) approaches, there is the practical
difficulty, and hence the possibility, that the alternative forms are not measuring organizational
maturity to the same degree. This would lead to an underestimate of reliability. In the case where
different assessors are used in the alternative forms, any discussions amongst them about the
assessed organization and its status immediately prior to or during the assessment would likely
contaminate the reliability estimates (i.e., the assessments would no longer be independent).
Since there is heavy reliance on the degree to which alternative forms are really equivalent,
the reliability estimate obtained using this method is referred to as the equivalent-forms
coefficient.
- 28 -
14 According to the concept of parallel tests, we can ascertain the extent of measurement error if we can obtain aseries of measures for a particular construct (e.g., organizational maturity). A series of k measures is consideredparallel if the true score is the same for all k measurements of the construct, and if the error variance of each of thek measurements is equal. The concept of parallel tests thus indicates that each of the k measures should be asgood a measure of the construct as any of the other k-1 measures. It is not necessary that the same measurementoperation be performed for all the k parallel measurements, but only that all the measurement operations must bemeasuring the same construct to the same degree.
Split-Halves Method
With the split-halves method, the total number of items in an assessment instrument are divided
into two halves and the half-instruments are correlated to get an estimate of reliability. The
halves can be considered as approximations to alternative forms. A correction must be applied to
the correlation since that correlation gives the reliability of each half. One correction is known as
the Spearman-Brown prophecy formula: 2r/(1 + r), where r is the correlation. The Spearman-
Brown formula should only be used when the halves can be considered as parallel tests.
Otherwise, the Cronbach alpha coefficient (described later) should be used. Another formula
developed by Rulon (Traub 1994) can be used to estimate the reliability of the whole instrument
and does not require assumptions about the halves being parallel.
This method has the advantage that equivalent forms of an assessment procedure are not
necessary and there is no need for assessments on multiple occasions. A problem with this
method, however, is deciding how to divide an instrument in two parts. The procedure generally
used is to take even numbered items on an instrument as one part and the odd numbered ones
as the second part. This approach is generally preferred since, assuming the sequence of
responses obtained follows the sequence in the instrument, it controls for any systematic factors
operating during the assessment period that may influence responses from earlier on in the
assessment to later on in the assessment.
Internal Consistency Method
With this method one examines the covariance amongst all the items in an assessment
instrument simultaneously. Two estimates of internal consistency are Guttman's L2 (Traub 1994)
and Cronbach's alpha (Cronbach 1951).
In another related scientific discipline, namely Management Information Systems (MIS),
researchers tend to report the Cronbach alpha coefficient most frequently (Subramanian and
Nilakanta 1994). Also, it is considered by some researchers to be the most important reliability
estimation approach (Sethi and King 1991). Thus, due to its frequent use and perceived
importance, the logic of computing the Cronbach alpha coefficient (from a covariance matrix) will
be described in more detail below.
- 29 -
The type of scale used in the most common maturity assessment procedures is a summative
one. This means that the individual scores for each question are summed up to produce an
overall score. One property of the covariance matrix for a summative scale that is important for
the following formulation is that the sum of all the elements in the matrix give exactly the variance
of the scale as a whole.
One can think of the variability in a set of item scores as being due to one of two things: (a)
actual variation across the organizations in maturity (i.e., true variation in the construct being
measured) and this can be considered as the signal component of the variance, and (b) error
which can be considered as the noise component of the variance. Computing the Cronbach
alpha coefficient involves partitioning the total variance into signal and noise. The proportion of
total variation that is signal equals alpha.
The signal component of variance is considered to be attributable to a common source,
presumably the true score of the construct underlying the items. When maturity varies across the
different organizations, scores on all the items will vary with it because it is a cause of these
scores. The error terms are the source of unique variation that each item possesses. Whereas all
items share variability due to maturity, no two items share any variation from the same error
source (this is an assumption of the classical theory presented earlier).
Unique variation is the sum of the elements in the diagonal of the covariance matrix: Ssi2 .
Common variation is the difference between total variation and unique variation: sy2 - Ssi
2, where
the first term is the variation of the whole scale. Therefore, the proportion of common variance
can be expressed as: (sy2 - Ssi
2)/sy2 . To express this in relative terms, the number of elements
in the matrix must be considered. The total number of elements is k2, and the total number of
elements that are communal are k2 - k. Thus the corrected equation for coefficient alpha
becomes: (k/k-l)[(sy2 - Ssi
2)/sy2].
- 30 -
***** Insert Figure 10 around here *****
To give a concrete example, we consider the covariance matrix for one dimension of maturity
described in the body of this paper, namely standardization. Using the data from our case study
application, the covariance matrix of the standardization dimension is shown in Figure 10. The
M's in the table correspond to the items in Figure 3. Using the equation given above, the value of
Cronbach alpha was computed to be: 1.25 [(53.95 - (4.32 + 3.86 + 2.62 + 2.55 + 2.46))/53.95] =
0.8837. This is the value given in Figure 3.
- 31 -
Appendix B: Part of the Organizational Maturity Instrument
Instructions
The purpose of this instrument is to measure the overall maturity of an InformationSystems department. All the questions concern the particular Information Systems depart-ment that you have been involved with recently. Pilot applications of this instrument indi-cate that it takes approximately 10-20 minutes to complete. Please answer all the ques-tions.
You will find in this questionnaire some overall characteristics of the InformationSystems department. These characteristics are considered important with respect to thedepartment's maturity. Beneath each characteristic you will find a scale. You are to ratethe characteristics on each of these scales in order.
Here is how you are to use those scales:
If you feel that the concept is very closely characterized by one end of the scale, youshould check-mark as follows:
If you feel that the concept is quite closely characterized by one or the other end ofthe scale (but not extremely), you should place your check-mark as follows:
If the concept seems only slightly characterized by one side of the scale as opposedto the other side (but is not really neutral), then you should check-mark as follows:
If you consider the concept to be characterized as neutral on the scale, or bothsides of the scale equally characterize the concept, then you should place your check-mark in the middle space:
Thank you for your time spent completing this questionnaire.
sufficient |__|__|__|__|__|__|X__| insufficient
sufficient |X__|__|__|__|__|__|__| insufficient
sufficient |__|__|__|__|__|X__|__| insufficient
sufficient |__|X__|__|__|__|__|__| insufficient
sufficient |__|__|__|__|X__|__|__| insufficient
sufficient |__|__|X__|__|__|__|__| insufficient
sufficient |__|__|__|X__|__|__|__| insufficient
- 32 -
1. The delivery approach and methodology:
2. The modeling standards:
3. The systems documentation:
4. User requirements standards:
5. Development standards:
6. Systems cost/benefits analysis are produced and monitored:
7. The delivery of systems:
8. A project manager is clearly identified for every project:
9. The project plan is produced, updated and communicated:
10.Project reviews:
none or varies withindividual
|__|__|__|__|__|__|__| defined and used by all
no standards |__|__|__|__|__|__|__| strict standards basedon architecture
none or very inconsis-tent
|__|__|__|__|__|__|__| available, current andclear
none |__|__|__|__|__|__|__| well established andimplemented
none |__|__|__|__|__|__|__| well established andimplemented
frequently not done |__|__|__|__|__|__|__| formal justification andmonitoring
consistently late |__|__|__|__|__|__|__| consistently on time
seldom |__|__|__|__|__|__|__| always
seldom |__|__|__|__|__|__|__| frequent
none or undefined |__|__|__|__|__|__|__| defined and implemented
- 33 -
11.Systems implementation and deployment:
12.The use of project management tools:
13.The integration of methodology and techniques with tools:
14.The availability of tools:
15.Support for tools:
16.The overall organization's strategy, missions, goals, tactics and priorities:
17.The IS organization's strategy, goals, and priorities:
18.The alignment between business strategy and system projects:
19.The IS organization's ability to absorb and implement innovations:
20.The budgeting process of the IS organization:
improvised |__|__|__|__|__|__|__| smooth and wellplanned
none |__|__|__|__|__|__|__| pervasive
no integration |__|__|__|__|__|__|__| effectively reenforceeach other
none |__|__|__|__|__|__|__| yes to everybody
inadequate support |__|__|__|__|__|__|__| adequate support
Undefined |__|__|__|__|__|__|__| clearly documentedand understood
none |__|__|__|__|__|__|__| clearly documentedand understood
weak |__|__|__|__|__|__|__| strong
inferior |__|__|__|__|__|__|__| superior
integrated with organi-zational priorities
|__|__|__|__|__|__|__| not integrated withorganizational priorities
- 34 -
Appendix C: Evidence of Validity
In this appendix we present some further evidence as to the validity of the maturity scales that
we have developed. Three sets of analysis were performed following the recommendations of
Bohrnstedt (1970), Kerlinger (1986), and Nunnally (1978).
Average Interitem Correlations
The items within a dimension should correlate higher with each other than they do with items in
other dimensions (Bohrnstedt 1970). Therefore, we computed the average interitem correlations
within each dimension and compared them with the average correlation of these same items with
items in the other three dimensions. In all comparisons performed, it was found that the within
dimension average correlation was higher than the average correlations with other dimensions.
Item-total Correlations
The items within a dimension are expected to correlate highly with the total score for that
dimension. Such correlations provide further evidence of validity (Kerlinger 1986; Nunnally 1978).
In calculating such item-total correlations, each item score was subtracted from the total to avoid
a spurious part-whole correlation (Cohen and Cohen 1983), and the correlation of each item with
the new total score was computed. These results are shown in Figure 11. As can be seen, all
correlations are relatively high, and all are significant (one-tailed test) at a=0.005.
Factor Analysis
Factor analysis is considered to be (Kerlinger 1986) "a powerful and indispensable method of
construct validation." The results of this analysis for the four dimensions are shown in Figure 11.
As can be seen, the items that are expected to tap one dimension load highly on one factor, and
have low loadings on the other factors (all missing coefficients are less than 0.5).
- 35 -
***** Insert Figure 1 1 around here *****
The results of the above three sets of analyses, as well as the method used to develop the
maturity instrument, increases our confidence as to the validity of the maturity scales. However,
demonstrating validity is an on-going process, and it is through the accumulation of evidence
from multiple studies that we can start to make strong claims of validity.
- 36 -
Appendix D: Guidelines for Increasing Reliability
Below, we present some development and use guidelines for maturity assessments. These
guidelines identify issues that should be taken into consideration by those who develop
assessment procedures, and by those who use them. These guidelines describe ways of
increasing the reliability of assessment scores:
1 Standardize Assessment Procedures
The procedures used for an assessment must be standardized and individual assessment
results must follow them closely to ensure consistency. In the case of assessment
instruments, instructions concerning the purpose and how to determine responses and judge
scores should be provided. In the case of interviews, the conduct of the interviews (e.g.,
assurance of confidentiality and the type and scope of evidence inspected) should be
defined.
2 Training of Assessors
Assessors should be trained in the assessment procedure and should have extensive
experience with software development and maintenance. Furthermore, there should be a
consistency in the qualifications of the assessors following a particular assessment
procedure.
3 Increasing the Length of the Assessment Instrument
Reliability estimates utilize the assessment scores. The more questions asked about the
maturity of an organization, the more likely that the reliability estimates are increased. Of
course, if the added questions have nothing to do with maturity, then increasing the length of
the instrument may reduce reliability. However, it is assumed that added questions are
chosen as carefully as the original questions and that they will not reduce the average inter-
item correlations.
- 37 -
4 Sampling of Projects
In assessment procedures where a sample of an organization's projects are assessed, and
these are used as an indicator of overall organizational maturity, specific sampling criteria
should be specified. These sampling criteria should be applied consistently in all
assessments claiming to follow a particular assessment procedure.
5 Using Multiple Point Scales
Determining the number of points on a scale involves a tradeoff between losing some of the
discriminative powers which the assessors are capable of (with too few points) and having a
scale that is too fine and hence beyond the assessors' powers of discrimination (with too
many points). In general, it has been found that there is an increase in reliability as the
number of points increases from 2 to 7, after which it tends to level off (Nunnally 1978;
Guilford 1954).
6 Having a Validation Cycle
Such a cycle involves validating the information that the assessors have initially gathered.
This may involve corroborative interviews and (further) inspections of documents. This would
seem to be more important were the assessors are external to the assessed organization
and when there is a danger of misrepresentation by the assessees.
7 Using the Same Assessors
Where no estimates of reliability are available or no reliability studies have been performed
for a particular procedure, it is safer to have the same assessors perform assessments on
different occasions and/or for different organizations. For example, in progress self-
assessments where an organization wants to determine whether maturity has increased due
to some improvement efforts, it is preferable that the same assessors be used. Also, if one
were to rank n organizations based on the assessment results, reliability would be increased
if the same assessors conducted all the assessments.
Following the above guidelines would be considered as good practice for increasing the
reliability of assessment procedures. Of course, not all the guidelines are applicable for all
- 38 -
assessment procedures. The context of using the assessment procedure should be taken into
consideration.
- 39 -
- 40 -
Figure 1: Path diagram depicting the organizational maturity construct and example items for itsassessment.
- 41 -
Characteristic Value Percentage/Average
Location of Organizations CanadaU.S.A.
Australia
58%31.6%10.5%
Position of Respondents ManagementTechnical
Coaching/Auditing
39.5%31.6%28.9%
Years of Experience ofRespondents
ManagementTechnical
Coaching/Auditing
11.47 Yrs.11.17 Yrs.14.18 Yrs.
Main Business of Organization Government/Public Admin./MilitaryRetail, Distribution and Transportation
AerospaceFinancial/Insurance/Real Estate
Manufacturing (other than aerospace)Other
36.8%18.4%15.8%7.9%7.9%13.1%
Budget or Total Revenue ofOrganization
<= CA$99mCA$100m - CA$149mCA$150m - CA$199mCA$200m - CA$249mCA$250m - CA$999m
>= CA$1 billion
10.5%5.3%2.6%7.9%18.4%55%
Figure 2: Characteristics of respondents and assessed organizations.
- 42 -
Figure 3: Characteristics of four dimensions of organizational maturity and overall maturity.
DimensionName
ItemsCronbach
Alpha /Composite
StandardDeviation Mean
Standardization • The extent to which the delivery approachand methodology are defined and used byall in the IS department (M1)
• The extent to which strict modelingstandards based on the architecture aredefined (M2)
• The extent to which systemsdocumentation is available, current andclear (M3)
• The extent to which user requirementsstandards are established andimplemented (M4)
• The extent to which developmentstandards are established andimplemented (M5)
0.8837 7.3444 17.7105
ProjectManagement
• The extent to which systems costs/benefitsanalysis are formally produced andmonitored (M6)
• The extent to which the delivery of systemsis consistently on time (M7)
• The extent to which a project manager isclearly identified for every project (M8)
• The extent to which a project plan isproduced, updated and communicated onevery project (M9)
• The extent to which project reviews weredefined and implemented (M10)
• The extent to which systemimplementation and deployment is smoothand well planned (M11)
0.9056 9.4299 21.8592
Tools • The extent of use of project managementtools (M12)
• The extent to which methodology andtechniques are integrated with tools (M13)
• The extent to which tools are available toeveryone in the IS department (M14)
•` The adequacy of the support provided fortools (M15)
0.8755 6.5075 13.9559
Organization • The extent to which the overallorganization's strategy, missions, goals,tactics and priorities are clearlydocumented and understood (M16)
• The extent to which the IS organization'sstrategy, goals, and priorities are clearlydocumented and understood (M17)
• The strength of the alignment between thebusiness strategy and systems projects(M18)
• The IS organization's ability to absorb andimplement innovations (M19)
• The extent to which the budgeting processof the IS organization is integrated withorganizational priorities (M20)
0.8155 6.6178 21.9941
OverallMaturity
• Standardization• Project Management• Tools• Organization
0.9486 23.3464 75.5197
- 43 -
Standardization
Project Management
Tools
Organization
Maturity Profile
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
Org. P Org. Q
Figure 4: The profiles of two organizations on the four dimensions of maturity.
- 44 -
Figure 5: The standard error of measurement for difference scores between dimensions.
- 45 -
Figure 6: The standard error of measurement for difference scores within dimensions.
- 46 -
Figure 7: Reliability, standard deviations, and means of the two requirements engineeringsuccess dimensions.
- 47 -
Figure 8: Correlations (Pearson and Spearman) and corrected correlations between maturityand requirements engineering success.
- 48 -
Figure 9: A classification of reliability estimation methods.
- 49 -
Figure 10: Covariance matrix of the first dimension of maturity: standardization.
- 50 -
Item Factor 1 Factor 2 Factor 3 Factor 4Item-total
Correlations
M1M2M3M4M5
0.65670.85850.65030.77880.7863
0.65100.83720.64750.67170.8437
M6M7M8M9
M10M11
0.73440.71920.67220.63520.87580.8272
0.59270.67440.74040.72950.89100.8477
M12M13M14M15
0.73110.73550.90220.8151
0.60230.79840.85850.6943
M16M17M18M19M20
0.83150.69860.56940.72160.6956
0.78730.72180.51390.61370.4138
Figure 11: Factor analysis results and item-total correlations for the four dimensions of maturity.
Sta
ndar
diza
tion
Pro
ject
Man
agem
ent
Too
lsO
rgan
izat
ion