postgraduate certificate in statistics design and analysis of experiments … notes... ·...

Course Manual: PG Certificate in Statistics, TCD, Revised Draft, 18/1/16 Design and Analysis of Experiments Module © 2016 Michael Stuart

Postgraduate Certificate in Statistics Design and Analysis of Experiments

Introduction

1. What is an experiment? ........................................................................................................ 2

Experiment as demonstration .......................................................................................... 2

Thought experiments ....................................................................................................... 3

A statistical thought experiment ................................................................................. 3

Simulation experiments ................................................................................................... 4

Comparative experiments ............................................................................................... 4

Control, a key feature of comparative experiments .......................................................... 5

2. Design issues in experimentation; a manufacturing case study ............................................. 6

Statistical assessment of a process change .................................................................... 6

Confounding factors ........................................................................................................ 7

Attaining homogeneous experimental conditions ............................................................. 9

Replication 10

Randomization 11

Illustration ................................................................................................................ 11

3. How randomization works, a clinical trial case study ........................................................... 12

Randomize to balance the effects of unknown covariates ............................................. 13

Randomisation to minimise bias .................................................................................... 14

Random versus systematic assignment ........................................................................ 14

4. Multi-factor experiments ...................................................................................................... 14

Traditional vs. statistical design ..................................................................................... 14

Efficiency ................................................................................................................. 15

Interaction ............................................................................................................... 16

Several levels 17

Several factors 19

5. Experimental vs. Observational Studies .............................................................................. 19

6. Strategies for Experimentation ............................................................................................ 19

Appendix 1.1.1 Calculation of the control limits for Fig 1.1.1 ................................................. 20

2

1. What is an experiment? Experiment as demonstration Schoolroom and undergraduate laboratory "experiments" tend to be concerned with demonstrating application of scientific theories and also demonstrating laboratory methods and techniques. Consider, for example, measuring the period of a pendulum, as illustrated in the following diagram, taken from Wikipedia, http://en.wikipedia.org/wiki/Pendulum. (The originator released this work to the public domain for any use whatever).

The pendulum's period is the time it takes to complete a cycle, from being released at one end of its trajectory to the other end and back again. With an ideal pendulum, the period, T, depends

on its length, L and, to a small degree, its amplitude, . For small amplitudes, the latter dependence is minimal and T is given by the formula

g

L2T ,

to a high degree of approximation, where g is the acceleration due to gravity This provides a simple way of calculating g. Simply set up a pendulum, measure its length, let it swing through a complete cycle, measure the time to complete this and calculate

2

2

T

L4g .

This is a very simple example of an "experiment" as demonstration. A more profound example is due to the English scientist Isaac Newton. He observed the "scattering" of white light into its component colours when passed through a prism and used this observation as a basis for formulating a theory of colour. The theory prevailing at the time was that light was pure and the breakup into colours by a prism was due to impurities in the prism. He demonstrated the falsity of this theory by shining light through a prism onto a screen, producing a "rainbow" of colours on the screen. He made a gap in the screen corresponding to one of the colours, allowing it to pass through. He then allowed it to pass through a second identical prism onto a second screen; the colour was unchanged. Such demonstration experiments are clearly important. However, they are not the subject of this module.

3

Thought experiments The Italian scientist Galileo Galilei disposed of another erroneous theory by way of a thought experiment, without the necessity for actually conducting an actual experiment. The Greek philosopher and scientist Aristotle believed that heavy objects falling through space would travel faster than light objects; according to his theory, the speed of falling objects was proportional to their weight. Galileo disproved Aristotle's theory by imagining two objects, one heavier than the other, tied together and dropped from a height. According to the Aristotelian theory, the smaller object will tend to fall more slowly than the larger and, because the objects are tied together, the smaller object will tend to slow the speed of the larger one so that both objects will fall more slowly than the larger one would on its own. But since the two objects tied together form a single object that is larger than either one, the combined pair of objects should fall faster than the larger one would on its own, according to the Aristotelian theory. This contradiction disproves the Aristotelian theory, which had held sway for sixteen centuries. A statistical thought experiment Statistical inference relies heavily on the notion of sampling distribution. The sampling distribution of the mean was introduced in Base Notes §1.5, pp/30-31 and used extensively subsequently. The key idea is that an average value calculated from a sample of values sampled from a single population is subject to less variation than individual values sampled from the population. This is what makes statistical inference about the population man work. As Base Notes §1.5, p32 puts it, "this gain in precision is the reason why we base decisions on means in practice". To understand why this is so, imagine repeatedly randomly sampling several values from a population, (the same number each time), and each time calculating the mean value. As the repetitions progress, build up a frequency distribution of both the randomly sampled values and the mean values. As the repetition progresses, the frequency distribution of sample values will more and more resemble the population frequency distribution. However, as each mean value falls towards the middle of the corresponding sample values, the mean values are seen to be subject to less variation than the individual values. Consequently, the frequency distribution of repeatedly sampled mean values will have less spread than the frequency distribution of the individual values. With indefinitely repeated sampling, the latter frequency distribution effectively becomes the same as the population frequency distribution. The former becomes the sampling distribution of the mean. The ability to think ones way through "experiments" such as this is an important indicator of a capability for statistical thinking, a fundamental requirement for successful statistical analysis1.

1 Two masters of the art of statistics, Frederick Mosteller and John W. Tukey, in their book Data Analysis

and Regression, A Second Course in Statistics, Pearson, 1977, put it as follows: "One hallmark of the statistically conscious investigator is his firm belief that however the survey, experiment, or observational program actually turned out, it could have turned out somewhat differently. Holding such a belief and taking appropriate actions make effective use of data possible." In the 1990's, the notion of "statistical thinking", began to become formalised and the centrality of its role emphasised. One version advocates a view of statistical thinking that involves a focus on processes, the recognition that all processes are subject to variation and that identifying, characterising, quantifying and reducing process variation are keys to successful problem solving and process improvement.

https://www.scss.tcd.ie/postgraduate/pgcertstats/current/Lecture%20Notes/2015-16/ST7001/2015-Chapter-1-cert.pdf

4

Simulation experiments The statistical thought experiment described above provides an intuitive introduction to the concept of sampling distribution and arrives at the conclusion that the sampling distribution of the mean has less spread than the sampled population. The repeated sampling idea on which it is based also provides the basis for simulation of sampling distributions, using computer simulation software. One such experiment is described in Base Notes §1.5, pp32-36, in which it is demonstrated that the sampling distribution of the sample mean may be approximately Normal even in cases where the sampled population frequency distribution is not Normal. As another example, a slight modification of the simulation just referred to could be devised to demonstrate the standard error of the mean, that is, the standard deviation of the sampling distribution of the mean, approximates the population standard deviation divided by the square root of sample size. This may be pedagogically helpful for students who may be deterred by the mathematical derivation of the standard formula

n/)X(SE

Such a simulation experiment may be easily implemented in software such as Minitab, as may be verified by undertaking the following exercise2. Exercise ??? Comparative experiments A curious student, having completed the experiment / demonstration with a pendulum as described above, may wonder what would happen if the amplitude of the pendulum was changed, or if the length of the string was changed or if the weight of the "bob" was changed. If the formula used to calculate g is to be believed, changing the length will change the period but changing either of the other two will have no effect. A sceptical student may want to check this. The obvious approach is to make changes to the amplitude, the length and the weight, run the experiment described above after making each change, observe the results and compare results with and without the changes. This is precisely what the aforementioned Galileo Galilei did at the beginning of the 17th century when he explored the properties of the pendulum. He found that changing the weight had no effect on the period and, provided the amplitude was not too big, changes in that also had no noticeable effect on the period. However, he found that increasing the length had the effect of increasing the period. The fact that the period did not change appreciably over a long time even though the amplitude was gradually decreasing meant that the pendulum could be used as a time keeper. The fact that the period changed with length meant that the length could be adjusted to make the period correspond to a desired unit of time, such as one second. These investigations provided the basis for much more accurate timekeeping than had been possible up to then.

2 Further illustrations in a practical setting and some background discussion of simulation may be found in

Mullins, E. and Stuart, M, (1992) Simulation as an Aid in Practical Statistical Problem-Solving, Journal of the Royal Statistical Society. Series D (The Statistician) Vol. 41, No. 1, pp. 17-26.

5

These experiments conducted by Galileo3 are examples of comparative experiments. A comparative experiment is a programme of actions undertaken to study the effects of making changes to a process or system. Such an experiment is conducted by measuring the output of the process without and with the changes and noting the differences. As George Box put it,

To find out what happens when you change something it is necessary to change it. (BHH, p. 404)

Comparative experiments will be the focus of attention in this module. Control, a key feature of comparative experiments For comparative experiments to be successful, a key requirement is a suitable degree of control of the study environment. How much control can be exercised depends on the nature of the study environment. There is considerable debate, if not controversy, surrounding this topic. At one extreme, some philosophers argue that true experiments require a comparison of the system's behaviour with and without the change whose effect, if any, is being studied. Since both situations cannot be observed simultaneously, the purist will argue that there is no such thing as a true experiment in which the only possible explanation for a difference in observations is the change under study. If the system may be observed with and without the change at two different times, then something else may have changed between times that may have influenced the system. This is referred to as the counterfactual argument by philosophers; if the system is observed with the change in effect, then the system without the change in effect is counterfactual. If scientists were to follow this philosophical prescription, then it is doubtful if any science at all could be done. Fortunately, practical scientists conduct comparative experiments by using control of the study environment to approximate the counterfactual ideal. Here, control means that the effects of any changes that occur to the system between repetitions of the experiment, other than the change under study, are individually negligible although their combined effect may be noticeable. Statisticians treat such combinations of negligible effects as being due to chance variation4. When a change is introduced, checking whether it has an effect or not amounts to comparing the difference in response without and with the change to a corresponding measure of chance variation, that is, a test of statistical significance5.

3 Galileo has been referred to as "the father of modern science", largely because of his contributions to

developing scientific method. Although not the originator of the experimental method, his use of comparative experiments was a significant contribution to these developments. 4 More formally, this amounts to assuming that such variation occurs in accordance with an appropriate probability model, typically in the applications encountered here, the Normal probability model. Formal probability theory is needed to understand the mathematical basis for such theoretical models. Appropriate diagnostic analysis should be undertaken to check their validity in any given case. Here, an informal approach is adopted to chance models for statistical variation so that the often forbidding apparatus of formal probability is not required. Further development of this point of view may be found in Stuart, M., An Introduction to Statistical Analysis for Business and Industry, Arnold /Wiley, 2003; see §1.4. 5 An alternative approach to statistical significance testing in designed experiments may be used when

random assignment of treatments to experimental units, introduced in the next two sections, is used. Such statistical tests are referred to as randomisation tests or, sometimes, exact tests. For some, this is the preferred approach. It has been shown that the exact tests and the Normal theory tests are almost the same when the Normal theory assumptins hold. Accordingly, the more familiar Normal theory approach will be followed here.

6

It should be noted that, in some areas of physics and chemistry, the element of chance variation referred to is so small as to be negligible6. Statistics has little to say about the design of such experiments. The focus here, therefore, will be on areas of application where the level of uncontrolled chance variation is comparable to the effects of changing the variables under study. 2. Design issues in experimentation; a manufacturing case study This case study has been discussed in Base Module Chapter 4, §4.1 Here, the ideas involved are reviewed and elaborated. Statistical assessment of a process change Dr. Gerald J. Hahn of Corporate Research and Development at General Electric described a simple experiment to evaluate the effect of making a change in a process for the bulk manufacture of an electronic component7. The old and new processes were run on alternate days for a period of eight weeks, switching the sequence (old followed by new or new followed by old) from week to week. Thus, during the first week, the new process was run on Monday, Wednesday and Friday with the old process being run on Tuesday, Thursday and Saturday while, during the second week, the old process was run on Monday, Wednesday and Friday with the new process being run on Tuesday, Thursday and Saturday, and continuing this pattern in subsequent pairs of weeks. On each day, a sample of 50 components was checked and the number of defective components counted. The counts on successive pairs of days were recorded and tabulated as in Table 1.1.1 on page 7. On this evidence, it appears as if the new process is slightly better than the old. However, there is always the possibility that this apparent improvement is consistent with chance variation and that there is no real or long term improvement. An informal assessment of this may be made using a control chart. Figure 1.1.1 below shows a line plot of the differences in the last column of Table 1.1.1, with "control limits" representing a band of chance variation around a centre line at 0. These are "3-sigma" limits, based on Shewhart's idea that, in normal circumstances, we can expect some haphazard or chance variation in processes but that we can put more or less well defined limits on such variation and that, if such limits are breached, then we must conclude that there is some assignable cause for the exceptional variation. Appendix 1.1.1 outlines the calculation of the control limits shown.

Figure 1.1.1 Differences in Numbers Defective with control limits

6 Ernest Rutherford, the first scientist to split the atom (1917) and regarded as the father of nuclear

physics, famously said "If your experiment needs statistics, you ought to have done a better experiment". 7 Full details are given in Dr. Hahn's article, Statistical Assessment of a Process Change in the Journal of

Quality Technology, Volume 14, Number 1, pages 1 - 9, January 1982.

-8

-6

-4

-2

0

2

4

6

8

0 4 8 12 16 20 24

Dif

fere

nc

e

Day pair

https://www.scss.tcd.ie/postgraduate/pgcertstats/current/Lecture%20Notes/2014-15/Lecture%20Notes%20ST7001/2014-Chapter-4.pdf

7

Table 1.1.1 Results of Comparison of Two Processes over Eight Weeks (24 Pairs of Days)

Numbers of Defectives

in Sample of 50 Units

Week Day pair

Old Process

New Process

Difference (New – Old)

1 1 0 0 0

1 2 6 3 –3

1 3 3 3 0

2 4 1 4 +3

2 5 2 0 –2

2 6 0 0 0

3 7 1 0 –1

3 8 3 1 –2

3 9 0 2 +2

4 10 1 0 –1

4 11 0 2 +2

4 12 3 1 –2

5 13 0 0 0

5 14 0 2 +2

5 15 0 0 0

6 16 1 1 0

6 17 0 0 0

6 18 2 0 –2

7 19 2 0 –2

7 20 0 0 0

7 21 0 0 0

8 22 0 1 +1

8 23 0 2 +2

8 24 0 0 0

Total 25 22 –3

Average 1.04 0.92 –0.13

Per Cent 2.08 1.83 –0.25

Figure 1.1.1 reveals no evidence for assignable causes of variation. This is entirely consistent with the hypothesis that the process change has no effect on process quality showing, as it does, a pattern of chance variation around the centre line at 0. A more formal test of this hypothesis supports this conclusion. The relevant test statistic is

39.024/57.1

24/3

n/

D

)D(SE

0DZ

Referred to a standard Normal frequency distribution, Z = −0.39 is not statistically significant. The conclusion to be drawn from this experiment and its analysis is that there is no statistically significant difference between the defect rates of the new process and that of the old. Confounding factors

8

The protocol used to implement the experiment described above was quite complicated and resource intensive requiring, as it did, a process change every one or two days. A much simpler approach would be to monitor the old process for a four-week period, then introduce the new process and monitor it for a further four weeks, and make a simple comparison of defect rates at the end of the eight-week period. In fact, this is what the engineers who were considering the process change proposed to do. However, this approach suffers from a serious flaw. Conceivably, the process could be affected by some other factor that changes with time. For example, if the defect rate is sensitive to changes in ambient temperature, then the normal seasonal change in temperature, taken over a two month period, could influence the results of the experiment, so that any perceived difference between the old and new processes could well be due to seasonal change rather than process change. Figure 1.1.2 shows that some such effect appears to have influenced the manufacturing process. The figure shows the numbers of defectives in time order, day by day, irrespective of which process, old or new, was used. Also shown is a computer generated "smoother" designed to minimise the effects of chance variation, leaving an estimate of an assumed underlying trend. Here, the smoother suggests that the daily numbers of defectives were higher during the first four-week period (days 1 to 24) than they are during the second. They were considerably higher during the first week, appeared to remain reasonably steady or perhaps rise slowly for the next three weeks and then seemed to reduce slowly over the second four-week period.

Figure 1.1.2 Numbers defective in time order

Table 1.1.2 on page 9 shows the difference in defect rates between the first four weeks and the second four weeks, for both processes and for each separately. The pattern is remarkably consistent, irrespective of process. We can see that the defect rate decreased by around 2% between the two periods. A formal test of the statistical significance of the observed differences between defect rates in the two periods may be based on the Z statistic for differences between percentages

0

1

2

3

4

5

6

7

0 6 12 18 24 30 36 42 48

Defe

cti

ves

Day

9

2

22

1

11

21

n

)P̂100(P̂

n

)P̂100(P̂

P̂P̂Z

.

Table 1.1.2 Defect rates, per cent, with differences,

for the first and second four week periods

First

Period Second Period Difference

Both Processes 3.0 0.9 2.1

Old Process 3.3 0.8 2.5

New Process 2.7 1.0 1.7

. For both processes combined, the value of Z is calculated as

75.356.0

1.2

1200

1.999.0

1200

973

9.00.3Z

,

highly statistically significant8. Since the experiment actually carried out indicated that there was no statistically significant difference between the defect rates for the two processes, the "Old" and the "New", we can safely conclude that there must have been another factor influencing the outcome of the experiment. Such factors are referred to as "confounding" factors. Their effects can be confounded with or confused with the effects of the factors of primary interest and, if there is no awareness of their presence, invalid conclusions will be drawn concerning the effects of the factors of primary interest. In the case of the experiment under discussion, if the old process had been run in the first period and the new in the second, the experimenters would have been inclined to conclude that the new process was better. Attaining homogeneous experimental conditions The design of even a simple experiment such as the one described here may be critical to permitting valid conclusions to be drawn from the results of the experiment. A fundamental principle is that comparisons be made in the most homogeneous conditions possible. Clearly, conditions were not homogeneous across the two four week periods and the simple design that assigns the old process to the first period and the new to the second would have failed because of this. An improved design would be to alternate processes between successive weeks; successive pairs of weeks are likely to be more homogeneous then successive four-week periods. However, the evidence in Figure 1.1.2 indicates that the first week seemed quite different from the rest. The obvious next design improvement in this case would be to alternate the processes on a daily basis. As it happened, the production line was shut down nightly and restarted each day, so that daily changes were feasible and convenient.

8 Analysis of proportions is discussed in Base Module Chapter 3. Here, the discussion is presented in

terms of percentages. Only slight modifications are needed. For example, the Z statistic above may be compared to the Z statistic shown on page 11, Base Module Chapter 3.

10

However, within the daily pairing arrangement, there is still room for improvement. If, as the evidence suggests, there is a trend in defect rate in some weeks, then putting one process first in each successive pair of days means that process will appear better if the trend is downward or worse if the trend is upward. To counteract this, the sequence of old and new process was alternated each week. Pairing of contiguous days such as was done in this experiment allows the comparison of Old and New to be done in the most homogeneous conditions possible. This makes it more likely that any substantial difference found may be ascribed to a difference between the processes and not to a difference that may be due to some other factor that could affect comparisons between days that were not close together in time. Pairing is an example of what is more generally referred to as local control. It is a special case of blocking, a reference to contiguous plots of land used in agricultural experiments. This will be formally introduced and discussed in the next lecture. Replication A single comparison of Old and New on a pair of successive days will not allow any conclusion to be drawn about the effect of changing the process; any observed difference could just as well be due to the chance variation between the days. The answer to this conundrum is replication whereby the comparison made on a pair of successive days is replicated on several other pairs of days, 24 in total in the case of this experiment. A bonus from using replication is that, not only can the effect of the change be estimated but also the extent of chance variation between days that will be there whether or not there is a change effect can also be estimated, thus setting up the conditions for a test of the statistical significance of the change effect. An obvious question in this context is: how much replication is appropriate? Some guidance on this question may be found from statistical power analysis, as discussed in Base Module Chapter 4, §4.2. The basic idea is as follows. As noted, the paired comparison was replicated 24 times. Why 24? If a single comparison will not allow any conclusion to be drawn about the effect of changing the process, will 2 comparisons not suffice? Alternatively, is 24 a sufficient number of replications or should we be using a larger number, perhaps 50 or 100? A key factor in answering these questions is the formula for the standard error of the average difference,

n/)D(SE ,

the denominator of the Z statistic used on page 7 to test the statistical significance of the observed average difference,

As n increases, the standard error of D decreases and so the value of the Z statistic increases,

making it more likely that the observed D is judged statistically significant, thus making the Z test more powerful. A formula for determining the sample size needed to achieve a desired level of power for testing a given difference (between percentages in this case) is discussed and illustrated in Base Module Chapter 4, §4.3, p. 25. The power achieved with a given sample size for testing a given percentage difference may be calculated by inverting this formula, as shown on pp. 26-27.





11

Randomization An alternative to the systematic alternating of weeks adopted in this case, with a view to avoiding time related biases, is random assignment of processes to days within pairs. Then, if there are trends or other systematic patterns that affect the system, the chances that such systematic patterns affect the results of the experiment are very small; in fact, the chances are the same as the chances of finding a systematic pattern in the results of twenty four successive tosses of a coin. The great advantage of random assignment is that, not only does it minimise the chances of biases arising from systematic patterns that we might anticipate, but also it minimises the chances of biases arising from systematic patterns that we might never think of. Random assignment is regarded as the "gold standard" method of assigning treatments to experimental units (e.g., processes to days) where it is important to be able to identify cause and effect. This view has been especially strong in the area of clinical trials but debate on this issue is emerging in this field. There has been long and extensive debate in the social sciences about the primacy of randomisation and the adequacy or otherwise of alternatives that have been proposed with a view to making causal inferences. These issues will be touched upon again later. Illustration Suppose that there was another systematic pattern that was not known to the engineers in charge of this process. Suppose that there is an additional "other factor", unknown to the experimenters, that can be Up or Down, alternating every day, including Sunday. Assuming that the systematic assignment of Old and New processes to successive days outlined at the outset was in place, the combination of the known and unknown factors in the first two weeks would follow the pattern:

Week 1 Week 2

Experimental

Factor Other Factor

Experimental

Factor Other Factor

Monday Old Up

New Down

Tuesday New Down

Old Up

Wednesday Old Up

New Down

Thursday New Down

Old Up

Friday Old Up

New Down

Saturday New Down

Old Up

Sunday Up

On inspection of the daily patterns of the two factors from Monday to Saturday, when the process is operational, it is seen that, whenever the Old process is run, the "other factor" is Up and whenever the New process is run, the "other factor" is Down. Thus, the two factors are irretrievably confounded and any observed change in process quality cannot be ascribed to the change in process solely. In the absence of knowledge or even suspicion of the presence of possible confounding factors, systematic assignment of processes to days is not reliable. On the other hand, random assignment is more likely to succeed by minimising the chances of variation patterns in an unknown factor coinciding with the variation patterns in the experimental factor. Ultimately, it may be as well to keep in mind this quote from MGM, §2.1, p.11:

"There is no foolproof method of overcoming the vagaries of allocating treatments to units"

12

3. How randomization works, a clinical trial case study The effects of randomization may be further demonstrated using the following example of a clinical trial where 596 patients suffering from heart disease were assigned at random to receive one of two treatments, Drugs or Surgery9. Of the 596, 310 were assigned to Drugs and 286 to Surgery At the same time as the treatments were administered, ten factors thought to be relevant to the outcomes of the treatments were observed on each patient. (Such factors are referred to as covariates). Table 1.1.3 shows the degree of balance of these factors between the two treatments. Thus, 94% of the drug treated patients and 95% of the patients treated with surgery were limited in ordinary activity due to their heart condition10.

Table 1.1.3 Comparison of patients in a coronary treatment trial on ten covariates.

Covariate Drugs

per cent Surgery per cent

Limitation in ordinary activity 94 95

History of heart attack 59 64

Heart attack indicated by electrocardiogram 36 41

Duration of chest pain >25 months 50 52

History of high blood pressure 30 28

History of congestive heart failure 8.4 5.2

History of stroke 3.2 2.1

History of diabetes 13 12

Enlarged heart 10 12

High serum cholesterol 32 21

The first test of how well the randomization worked is to check whether patients were assigned in roughly 50:50 proportions to the two treatments11. This is equivalent to counting the numbers of heads and tails in 596 tosses of a fair coin and checking whether the percentages of each is close to 50%, that is, whether the coin is fair. In this case, the 310 patients assigned to Drugs constituted 52% while the 286 assigned to surgery constituted 48%. A simple Z test may be used to check the statistical significance of these deviations from 50%;

98.005.2

2

596/4852

5052

n/)P̂100(P̂

50P̂Z

not statistically significant by the usual standards. Thus, the random assignment has achieved reasonable balance between the two treatments in this case. It should be remembered, however, that in approximately 1 in 20 (5%) experiments of this kind where random assignment is used and this Z test is applied, the result of the test will be statistically significant.

9 Murphy, M. et al (1977), New England Journal of Medicine, 297, 621-627. See also Rosenbaum, P.,

(2002), Observational Studies, 2nd ed., Springer. 10

The degree of limitation was assessed in accordance with the New York Heart Association Functional Classification. Patients were recorded as limited if they were classified in NYHA classes II or III. 11

This ensures that subsequent statistical analyses involving comparing the two groups are most precise. However, achieving such balance when patients may arrive for treatment haphazardly over a period of time and when there may be several treatment centres (there were 13 in this case) is easier said than done.

13

Of greater interest in this case is whether or not the randomization has achieved reasonable balance between the treatment groups with respect to the covariates. If not, it will not be possible to determine whether the difference in success rates between the two groups of patients was due to the different treatments or to the influence of one or more of the covariates. Intuitively, the treated patients constitute a random sample from the "population" of the 596 patients available for this trial and so the proportion of individuals with a given characteristic in the sample should reflect the corresponding proportion in the "population". By the same token, the untreated group also constitutes a random sample, so the same logic applies. Consequently, the proportions in each group should be approximately the same. In this case, this intuition can be tested by inspecting the data in Table 1.1.3 above. Informal comparison of the Drugs and Surgery percentages for most of the covariates indicates reasonable balance, although the treatment groups appear to be not well balanced with respect to serum cholesterol level. These assessments of balance can be formally tested using a series of Z tests for comparing two percentages. The formula for the relevant Z statistic is the same as that already used to test the difference between defect rates in the first and second periods of the process change case study, shown on Page 9. Exercise 1.1.1: Test the statistical significance of the difference between percentages of

high serum cholesterol patients in the Drug and Surgery treatment groups.

Exercise1.1.2 Identify the two next most imbalanced covariates in Table 1.1.3. Test the

statistical significance of the degree of imbalance in each case. The smaller difference in percentages appears more significant (has a larger Z value). Explain.

The results of these formal tests support the conclusions of the informal analysis. On the one hand, both treatment groups were subject to similar effects (if any) due to most of the covariates so that differences in success rates could not be attributable to them. On the other hand, the percentage of patients assigned to the Drugs treatment having high serum cholesterol level was considerably higher than that for patients assigned to Surgery. It is conceivable, therefore, that the success rates in the two groups of patient could differ because of this and it is not possible to distinguish such an effect from any difference in success rates attributable to the differing treatments. As in the case of the process change case study, when the effects of two possible causative factors are not distinguishable in this way, the factors are said to be confounded. Randomize to balance the effects of unknown covariates Thus far, randomization has been seen to achieve approximate balance in the assignment of patients to treatments as well as approximate balance in the distribution between the treatment groups of the values of (most of) a set of known covariates. The first was achieved deliberately; the assignment of patients to treatments was deliberately chosen to yield an approximately balanced result. However, the second was achieved with no reference to the values of the ten covariates; their balance came about because of the randomization of patients to treatments. It is entirely conceivable, indeed highly likely, that there are other factors unknown to the experimenters that may affect the outcome of the experiment, if they are not at least approximately balanced. But, because we have seen how the randomization has achieved approximate balance with known covariates, we may hope and expect that the experiment will be approximately balanced with respect to such unknown covariates also. This makes randomization a very powerful (if not foolproof) control tool.

14

Randomisation to minimise bias The randomization process has another advantage in experiments such as this. For example, if the consultant in charge of the patient's care was given the job of allocating treatments to patients, he or she may be tempted to use background knowledge of the patient's case to decide that one or other of the treatments may be better for that patient. That, if done consistently, would completely undermine the experiment. Random assignment, ideally by an independent agent, will reduce the possibility of such bias. On the other hand, not allowing the consultant to intervene may raise ethical issues for the consultant. A whole range of protocols has been developed to attempt to ensure that such sources of bias do not contaminate the experiment. Random versus systematic assignment In the process change case study, the actual assignment of days to processes, Old or New, was done according to a systematic rule, with the processes alternating every day within each week and the processes used on the first day of the week alternating from week to week. Because the change had no real effect, it was possible to see that there was a confounding covariate reflected in a downward trend. The systematic assignment was able to deal with this. However, there is always the possibility that there is another covariate whose presence is not known. There is no guarantee that the systematic process chosen could deal with that unknown covariate. Conceivably, the effect of the unknown covariate could be confounded with an actual effect of the process change in a way that each cancelled the other, thus concealing a real process change effect. Random assignment of days to process types would be more likely to achieve balance with respect to unknown covariates than systematic assignment. Randomization implies that the chances that an unknown covariate causes problem are small. It must be noted, however, that randomization is not the perfect answer, as was seen with the imbalance in the serum cholesterol factor in the clinical trial example. 4. Multi-factor experiments In section 11.4, attention was confined to just one experimental factor. In many cases, there are many factors that may potentially affect a process. In this section, we address some of the issues that arise in multi-factor experiments. The first issue concerns whether factor effects should be studied one factor at a time or whether several factors should be studied simultaneously. The "one-factor-at-a-time" approach has been used traditionally in much scientific investigation. Here, we show that it is inferior to the multi-factor approach, recommended by statisticians, in at least two ways. Traditional vs. statistical design Consider a process that may be affected by two factors, say a chemical manufacturing process where the yield of the process may be affected by operating pressure and operating temperature. A choice is to be made between two possible temperature levels, say "Low" and "High", as well as between low and high levels of pressure. Suppose available resources allow that the process may be run in experimental mode twelve times. It is easily demonstrated that the two factor approach makes more efficient use of the data in determining the best level for each factor.

15

Efficiency The traditional approach involves two steps:

keep one factor fixed at its standard level, say Temperature at "Low", run the process at each level of Pressure and choose the best level,

with Pressure set at its best level, run the process at the "High" level of Temperature and choose the best level of Temperature.

With twelve experimental runs, it is natural to run the process four times at each factor level combination, that is

Low Temperature, Low Pressure, Low Temperature, High Pressure, High Temperature, Best Pressure.

In the first step, the effect of changing Pressure is assessed by comparing the average of the four measurements of yield at Low Pressure with the average of the four measurements at High Pressure, with Temperature at its Low level in each case. In the second step, the effect of changing Temperature is assessed by comparing the better of those two averages of four measurements with the average of the four measurements at High Temperature. Now, consider the statistically recommended design, which looks at all combinations of level of both factors in a single study. This may be illustrated as in Figure 1.1.3 below where the subscripted Y's represent the twelve yield measurements made.

Y7 Y8 Y9

Y10 Y11 Y12

High

Pressure

Low

Y1 Y2 Y3 Low Temperature High

Y4 Y5 Y6

Figure 1.1.3 Illustration of a full factorial design

In this design, there are six measurements made at each level of each factor:

Y1, Y2, Y3, Y4, Y5 and Y6 at Low Pressure, compared to Y7, Y8, Y9, Y10, Y11 and Y12 at High Pressure;

16

Y1, Y2, Y3, Y7, Y8 and Y9 at Low Temperature, compared to Y4, Y5, Y6, Y10, Y11 and Y12 at High Temperature.

It is seen that the effect of changing the levels of each factor is assessed by comparing an average of six measurements with an average of six. This represents a considerable improvement on the comparison of four with four employed in the traditional approach. To achieve the same quality of comparison with the traditional approach, eighteen measurements, divided into three subsets of six, would be required. Looking at it in another way, with the two factor design, all twelve measurements are used twice in assessing the factor effects whereas, with the one-at-a-time approach, four measurements are used twice while eight are used only once. Thus, the two factor approach makes much more efficient use of the twelve measurements available. Interaction

There is a more subtle difference between the two approaches, which is demonstrated here with the aid of some hypothetical data. Suppose that, in a study following the traditional approach, the average response at low and high pressure, with temperature low in both cases, were 65 and 60, respectively. On this basis, low pressure gives the higher process yield and so the next step is to keep pressure low and run the process at high temperature. Suppose that the average yield under these conditions is 70. Assuming that the standard operating conditions are low temperature and low pressure, the conclusion from this experiment is that an improvement can be achieved by running the process at high temperature while retaining pressure at its low level. This analysis is illustrated in Figure 1.1.4.

60

High

Pressure

Low

65 Low Temperature High

70

Figure 1.1.4 Hypothetical results using the traditional approach

There is a potential flaw in this approach, however, arising from the fact that process performance has not been evaluated with both factors at their high levels. Conceivably, the yield in this case could be higher than at any other factor level combination, for example, as illustrated in Figure 1.1.5 on p.17. If, using the one-at-a-time approach, Temperature, rather than Pressure had been studied first, then the best combination of levels would have been found; at the first step, high temperature would have been chosen as best, and the second step, comparing low and high pressure at high temperature, would have led to the best combination. However, it cannot be regarded as satisfactory that locating the optimum conditions depends on having the good fortune to pick the right factor to study first. Here, the choice is between two factors. With several factors, there

17

are very many sequences of factors that might be chosen to study one at a time and, typically, very few sequences will lead to the optimum conditions. An experimental strategy that has just a small chance of locating the optimal conditions can hardly be recommended. An explanation for the possible failure of the one-factor-at-a-time approach in this case may be found in the pattern of changes illustrated in Figure 1.1.5.

60

75

High

Pressure

Low

65 Low Temperature High

70

Figure 1.1.5 Hypothetical results using the recommended approach

Note that process yield increases by 5, from 65 to 70, when Temperature changes from Low to High at the Low level of Pressure. However, at the High level of Pressure, the effect of changing from Low to High Temperature is 15, that is, from 60 to 75. Correspondingly, at the Low level of Temperature, yield decreases by 5, from 65 to 60, when Pressure is changed from Low to High, whereas, at the High level of Temperature, yield increases by 5, from 70 to 75. In short, the effect of changing the level of one factor depends on the level of the other factor. In statistical terminology, this is referred to as an interaction between the factors. Several levels When each factor has just two levels, there are just four possible combinations. If there are more than two levels per factor, the number of combinations increases. For this reason, it is advisable to keep the number of levels to a minimum. For many purposes, two levels are adequate. If the relationship between the response variable and the factors is non-linear, however, three levels may be advisable. Consider the following example. Suppose an experimental change of temperature from 50 (Low) to 60 (High) resulted in a yield improvement from 65 to 69. This may be depicted as in Figure 1.1.6 on page 18, as commonly seen in statistical software. Implicit in choosing the high level of temperature in this case is an assumption that the yield curve relating yield to temperature is linear, as depicted. Suppose, however, that the process had been run at a third Temperature level, say 55, intermediate between Low and High, with results as depicted in Figure 1.1.7. This clearly shows that the intermediate level is better. Conceivably, there may be other better levels, possibly as depicted in Figure 1.1.8.

18

Figure 1.1.6 Effects plot for one factor experiment

Figure 1.1.7 Effects plot for one factor experiment with three factor levels

Figure 1.1.8 Effects plot for a one factor experiment and a possible response curve With two factors, the response relationship may be depicted as a response surface. With more than two factors, graphical representation becomes virtually impossible. However, multi-factor designs with three or more levels may be used to assist in identifying optimal conditions.

64

65

66

67

68

69

70

71

45 50 55 60 65

Yie

ld

Temperature

64

65

66

67

68

69

70

71

45 50 55 60 65

Yie

ld

Temperature

64

65

66

67

68

69

70

71

45 50 55 60 65

Yie

ld

Temperature

19

Several factors Two factor experiments are relatively simple. To allow for interaction, all that is needed is to ensure that all possible factor level combinations are included in the experiment. The principles of blocking and randomisation apply just as readily; all that is required is to ensure that each possible combination of factor levels occurs once within each homogeneous block with random assignment of combination to experimental units within a block. As the number of factors increases, the number of level combinations rapidly increases. With three factors, each with two levels, the number of level combination is 2 × 2 × 2 = 8. If blocking and replication are required, the number of experimental runs required builds up very quickly. In such circumstances, not only do resources become a problem but also the task of controlling the experimental environment over the length of time necessary to complete the experiment becomes increasingly difficult. In addition, there are now three possible two-factor interactions and a possible three factor interaction, whose presence will complicate any analysis carried out. With four two-level factors, the number of possible level combinations is 16, with 5, it is 32, with 6, 64, etc. Effectively, so-called full factorial experiments, where the process is run with all possible level combinations quickly become impossible. Nevertheless, when a process is subject to possible influence of several factors, it is important to be able to distinguish the few (it is hoped) factors that have substantial effects. Fortunately, suitable designs have been devised for this purpose, sometimes referred to as screening designs. Carefully selected subsets, half, quarter, eighth, or less, of the full set of possible combinations may be chosen which give the necessary information when implemented experimentally. These designs are also referred to as fractional factorial designs. Their success depends on an assumption that there are no high order interactions or, in other words, that the response relationship is not too complicated. 5. Experimental vs. Observational Studies In preparation 6. Strategies for Experimentation

In preparation

20

Appendix 1.1.1 Calculation of the control limits for Fig 1.1.1 Shewhart's control limits for a variable plotted on a control chart are defined as

CL 3σ

where CL is the Center Line of the chart and σ is the standard deviation of the plotted variable. Interest is focused on whether or not there is a difference in defect rates between the old and new processes. This may be checked by seeing whether or not the observed differences deviate substantially from 0. Hence, it makes sense to locate the Centre Line at 0. A value for σ may be calculated from the observed differences shown in Table 1.1.1. Using a spreadsheet or standard statistical software, the calculated value for σ is found to be 1.57. Hence, the control limits are

0 3 x 1.57, that is

4.71.

As the plotted variable takes only integer values, applying these limits amounts to checking whether the observed differences are 4 or less in magnitude or 5 or more in magnitude. Visually, it makes sense to place the control limits halfway between these values, that is, at

4.5.

To return, type Alt+Back Arrow

postgraduate certificate in statistics design and analysis of experiments … notes... ·...

Documents