the design and analysis of studies in premature infants using human donor milk or preterm formula as...

7
INTRODUCTION W ITH THE INCREASED INTEREST in obtaining human donor milk (DM) for potential complete or partial nutrition in babies born pre- mature and, typically, at very low birth weights (1500 g), there has been a strong push for well- designed clinical trials comparing DM with mother’s own milk (MM) and preterm formula (PF). Certainly, whenever it is available, there is little question that MM is the best nutrition available for the newborn infant regardless of the birthweight; however, under conditions of prematurity this may not always be available. As a result, an obvious question for many neonatologists is whether DM (when available) is the best alternative or whether PF is as good under these circumstances? As noted, these questions can be answered properly only by clinical trials that employ proper statistical de- sign, including sample size, and analytical tech- niques with the subsequently collected data. The purpose of this paper is to outline those princi- ples so that published studies can be assessed properly on their merits and the conclusions drawn from these trials may be accepted. The recent article by Schanler et al. 1 brings these is- sues into focus, and it is the present author’s in- tent to contrast their approach to these studies with what is believed to be appropriate design and analytical techniques. This paper focuses on this particular study because it is one of the only trials that has attempted to compare MM, DM, and PF. Narayanan et al. 2 evaluated MM (with occasional supplementation by DM and nightly use of PF) versus PF alone in a randomized fash- ion. Lucas and Cole 3 performed two parallel studies involving DM and PF in one instance (in which MM could be used in either group), and PF versus term formula in the other instance (again, in which MM could be used in either group). In none of these studies was there a spe- cific attempt to compare all three preparations. DESIGN First and foremost with respect to the design and analysis of a controlled study, it is funda- mental to identify the study arms that will be evaluated within the context of the trial. Clearly, in a study that seeks to use MM as a 88 BREASTFEEDING MEDICINE Volume 1, Number 2, 2006 © Mary Ann Liebert, Inc. The Design and Analysis of Studies in Premature Infants Using Human Donor Milk or Preterm Formula as Primary Nutrition: A Critique of Schanler et al. MARTIN L. LEE ABSTRACT Nutritional studies of human milk in pre-term infants provide a unique challenge in clinical research. In this paper we review the general tenets of good clinical design and analysis and show how they might be properly applied in these situations. The recommendations are then compared with the approach used in a recent study by Shanler et al. It is concluded that fu- ture trials should consider these key statistical design and analytical issues. Department of Biostatistics, UCLA School of Public Health and Prolacta Bioscience, Monrovia, California.

Upload: martin-l

Post on 10-Oct-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

INTRODUCTION

WITH THE INCREASED INTEREST in obtaininghuman donor milk (DM) for potential

complete or partial nutrition in babies born pre-mature and, typically, at very low birth weights(�1500 g), there has been a strong push for well-designed clinical trials comparing DM withmother’s own milk (MM) and preterm formula(PF). Certainly, whenever it is available, there islittle question that MM is the best nutritionavailable for the newborn infant regardless ofthe birthweight; however, under conditions ofprematurity this may not always be available.As a result, an obvious question for manyneonatologists is whether DM (when available)is the best alternative or whether PF is as goodunder these circumstances? As noted, thesequestions can be answered properly only byclinical trials that employ proper statistical de-sign, including sample size, and analytical tech-niques with the subsequently collected data. Thepurpose of this paper is to outline those princi-ples so that published studies can be assessedproperly on their merits and the conclusionsdrawn from these trials may be accepted. The

recent article by Schanler et al.1 brings these is-sues into focus, and it is the present author’s in-tent to contrast their approach to these studieswith what is believed to be appropriate designand analytical techniques. This paper focuses onthis particular study because it is one of the onlytrials that has attempted to compare MM, DM,and PF. Narayanan et al.2 evaluated MM (withoccasional supplementation by DM and nightlyuse of PF) versus PF alone in a randomized fash-ion. Lucas and Cole3 performed two parallelstudies involving DM and PF in one instance (inwhich MM could be used in either group), andPF versus term formula in the other instance (again, in which MM could be used in eithergroup). In none of these studies was there a spe-cific attempt to compare all three preparations.

DESIGN

First and foremost with respect to the designand analysis of a controlled study, it is funda-mental to identify the study arms that will beevaluated within the context of the trial.Clearly, in a study that seeks to use MM as a

88

BREASTFEEDING MEDICINEVolume 1, Number 2, 2006© Mary Ann Liebert, Inc.

The Design and Analysis of Studies in Premature InfantsUsing Human Donor Milk or Preterm Formula as

Primary Nutrition: A Critique of Schanler et al.

MARTIN L. LEE

ABSTRACT

Nutritional studies of human milk in pre-term infants provide a unique challenge in clinicalresearch. In this paper we review the general tenets of good clinical design and analysis andshow how they might be properly applied in these situations. The recommendations are thencompared with the approach used in a recent study by Shanler et al. It is concluded that fu-ture trials should consider these key statistical design and analytical issues.

Department of Biostatistics, UCLA School of Public Health and Prolacta Bioscience, Monrovia, California.

control group (or “gold standard”), there is noquestion that participants who intend to pro-vide MM to their infant would not be expectedto be part of any study randomization. Thatleaves the randomized arms of the study to theDM and PF possibilities. This fact raises inter-esting and unusual design issues. Typically,randomized multigroup trials apply the ran-domization process to all potential studygroups. As a result, the sample size is deter-mined on the basis of the primary goal to com-pare all of the groups simultaneously (other-wise, why include all of the groups?), and theprimary statistical analysis is precisely thatcomparison. Furthermore, this is the legitimatestatistical approach, because the randomiza-tion among the groups allows for it. How is oneto determine the proper approach under asemi-experimental design in which there is astudy with multiple groups, not all of whomare randomized? It might be argued that re-gardless of the nature of the trial randomization(and, thus, design in this regard) the inclusionof three separate groups of patients (MM, DM,and PF) necessitates that the sample size deter-mination and, in turn, the analysis be performedas if all groups had been randomized. The onlyissue, of course, is legitimizing any statisticalcomparison with the nonrandomized arm. In or-der to do this properly, it is necessary to statis-tically adjust for potential group differences us-ing a multivariate approach. This particularproblem is discussed later.

On the other hand, it is quite possible thatthe goal of a trial being discussed here is sim-ply to compare primarily only two groups, pos-sibly DM and PF. This is easily done, becausethe patients who will participate can be ran-domized to these arms and the methodologyfor the design and analytical issues is well char-acterized.4 However, one other primary issuemust be addressed here that concerns whetherthe goal is to show that these two groups dif-fer or are equivalent with respect to the desiredprimary endpoint. With respect to the former,most readers are familiar with the clinical trialthat attempts to show one treatment is supe-rior to another (usually a control), and thestudy design centers around the amount of im-provement expected with the “new” treatment.(Statisticians usually refer to this improvement

as delta, and the sample size for the study iscalculated in order to have a high probabilityof demonstrating a statistical significant differ-ence if delta is real [“power of the study,” or 1-�], while minimizing the probability of findingstatistical significance if delta is not real [sig-nificance level, or �].) However, the design becomes more difficult when the goal is todemonstrate equivalence. Equivalence in a sta-tistical sense implies a clinical difference of lessthan some predefined amount, rather than amathematical equivalent value. This is dis-cussed in greater detail below. It is extremelyimportant to recognize, though, that the lack ofa significant difference between two groups isnot statistical evidence of equivalence. This iseasy to see if one designs a trial with only twopatients per arm. Obviously, such a trial in-evitably will show a lack of statistical signifi-cance, but it is doubtful that anyone would beconvinced by this “lack of evidence” for a be-tween-group difference.

Within the context of the previous discussion,consider the design of Schanler et al. It was in-tended as a three-arm study, although patients,of course, were only randomized to the DM andPF groups. How was sample size handled intheir situation? They state in their paper that theprimary analysis for the study was to comparethe DM and PF groups. Yet paradoxically thesample size calculations were based on the com-parison between either the DM or PF groups (orboth) with MM. As noted in the preceding, afundamental premise of any comparative clini-cal trial is to determine the sample size for thestudy on the basis of the primary study analy-sis. In this case, presumably that would mean ademonstration that the DM and PF groups werestatistically equivalent, because the paper ex-plicitly states that these products are expectedto be equal and, ultimately, pooled together. Onthe other hand, if the goal of the study is to com-pare the three nutritional products and, pre-sumably, demonstrate that MM is superior toboth DM and PF, then the sample size calcula-tion must be based on this premise. Schanler etal. also espouse this goal. As a result, there aretwo objectives for the trial, which are at crosspurposes from a design perspective. It is fun-damental to the sample size calculation andanalysis that the groups to be compared be

A CRITIQUE OF SCHANLER ET AL. 89

treated equally in these regards. In otherwords, as noted, a basic approach to analyzingmultigroup studies is to do a global analysis(i.e., evaluate all of the arms simultaneously)and then subanalyze any significant globalfinding using a multiple comparison approachthat controls for the overall significance levelof the testing. This procedure is basically es-poused in any fundamental statistical text-book.5 Thus, with respect to sample size calcu-lation, the first step is to determine the effectsize for the three group differences. Schanler etal. suggested that for their primary outcomevariable (incidence of LOS and/or NEC) theMM group would have a rate of 30%, whereasthe other two groups would have rates of 55%each. They perform their calculation with atwo-group comparison between MM and ei-ther DM or PF (which is not in keeping withthe spirit of the primary analysis of the study—the DM/PF comparison) using the mentionedrates and conclude that 70 subjects per groupare needed, in spite of the fact that all threestudy arms were to be compared ultimately. (Itis interesting to note that the rates found intheir study were nowhere near these predic-tions. In the Shanler study, the MM group hada 29% rate, whereas the DM and PF groups hada 39% rate each. Given these figures [whichyield a chi-square effect size of 0.098] and basedon the proper analysis using the chi-square testfor homogeneity, the study only had about 21%power to detect such a difference.)

It is very informative to determine the sam-ple size needed in this or any trial in which theaim is to statistically demonstrate the equiva-lence of two therapies. As noted, in order toperform this calculation, one must first definewhat is meant by “equivalence.” Clearly, thisis not precisely meant to be the exact same fre-quency of the outcome for both groups, be-cause that is unrealistic. Thus, some delta (inthe previous terminology) is required. Thisquantity represents the largest difference be-tween clinically indifferent treatment out-comes. For instance, in a comparison betweenDM and PF, one might be willing to clinicallyaccept the equivalence of these if the primaryoutcome (say late-onset sepsis and/or NEC)rates were within �10% of each other (and as-suming that the true proportion for this out-

come for both groups was on the order of 55%).With this definition and using 80% power witha two-tailed 5% significance level, then 424 pa-tients per group would be required (calculationbased on the PASS 2005 software program,Kaysville, UT). Even if the definition of equiv-alence were expanded to �15%, the sample sizeneeded would still be 189 per group. Of course,larger sample sizes would be needed for 90%power. For the 70 subjects per group used inthe Schanler et al. paper, a definition of equiv-alence of an absolute difference of �25% wouldhave been required to have the requisite powerto statistically demonstrate this. In terms of theprimary outcome (late-onset sepsis and/orNEC), if a baseline rate of 55% is assumed, thenequivalence would equate to rates of 30% to80%, or as few as 21 cases and as many as 56,if the comparator group (say PF) had approxi-mately 38 cases. There is little doubt that thisdefinition goes far beyond a reasonable valuefor delta. (After all, the sample size calculationused by Schanler et al. is based on the goal ofshowing a 25% difference between MM andDM/PF.) Thus, the consequence of such resultsis to make it quite difficult to argue that DMand PF can be demonstrated legitimately asequivalent, and thus pooled.

PRIMARY ANALYSIS

These comments concerning the sample sizecalculations in these types of studies lead to theissue of how a three-group comparison shouldbe dealt with statistically. As noted and sug-gested routinely as the method for a primaryanalysis under these circumstances, all threegroups should be compared simultaneously.With the rate or proportion data, this analysisis basically the familiar chi-square test (for ho-mogeneity of rates) that is addressed in basicstatistical textbooks. (A special adjustment forthe p-value calculation should be used if thereare particularly small rates, for example, a pro-portion of patients with more than one episodeof LOS and/or NEC that are less than a fewpercent). With continuous data (e.g. the growthparameters), the one-way analysis of variancemodel is appropriate. In Schanler et al. the ap-proach used was to pool the DM and PF groups

LEE90

if they were not statistically significant andthen compare this pooled group with MM. Asrecounted, the study was not designed and, inparticular, powered for this comparison. As aresult, a lack of statistical significance betweenthese two groups was almost inevitable and re-iterates the common statistical fallacy that alack of significance implies equality, when infact it merely suggests and should be correctlystated as a lack of statistical evidence for a dif-ference—a totally different conclusion.

With these ideas in mind, it is informative tosee the consequences of a more reasonable re-analysis of the Schanler et al. data. For instance,looking at the three-group comparison of theLOS and/or NEC primary endpoint, one findsan overall p-value of 0.30. For just the one-episode subjects p � 0.61 is obtained. (For themore than one–episode patients, p � 0.03, al-though these data should more properly be an-alyzed using a statistical [Poisson] model,which evaluates the number of episodes persubject and not just whether a subject had mul-tiple episodes. The latter approach has the ef-fect of equating a patient with two episodeswith one who had, say five. This type of anal-ysis was done in a study of intravenous im-munoglobulin in this same patient popula-tion.)6 It is also interesting to note with thesedata that there is no statistical difference be-tween DM and MM (p � 0.15); therefore, thereis no “justification” for simply pooling onlyDM and PF. In fact, if MM and DM are pooledand compared with PF, p � 0.52!

Similarly, with respect to LOS alone, the over-all p-value is 0.13 with a p-value of 0.58 for justthe single episode subjects and 0.14 (based on anexact calculation because of small cell sizes) forthe multiple episode subjects. The same lack ofsignificance is found for a comparison of the ratesof blood isolates across all isolates, p � 0.12(three-group comparison). There is also a con-clusion in the paper concerning the lack of effectof DM with respect to hospital stay, yet the com-parison of DM and MM in this regard shows nosignificant difference (p � 0.11).

Finally, with respect to the growth parame-ters, it is interesting to note, for instance, thatfor the length increment when the infant hasachieved 150 mL/kg per day until the end ofthe study, the authors claim a significant dif-

ference between the pooled DM and PF groupsversus MM (p � 0.03). Yet the DM and PFgroups are significantly different from eachother (p � 0.007), suggesting that both DM andMM are superior to PF and illustrating the dan-gers of pooling groups. Indeed, the only placein which a significant inter-group difference isfound that concurs with the authors findingsin this set of parameters was with the lengthincrement for the entire study.

Essentially, from a reasonable analysis of thestudy data, one must conclude that there arevirtually no differences among the threegroups in total, leaving an overall negative re-sult for the comparison of MM, DM, and PF.This may be attributed in part to the lack ofsample size for the trial.

ANALYTICAL ISSUES ARISING FROMTHE LACK OF COMPLETE

RANDOMIZATION

Other methodological issues with these typesof studies are worth pointing out. As noted, in-evitably complete randomization is understand-ably impossible in studies that choose to haveMM as the primary comparator. However, thistends to create a statistical problem that full ran-domization attempts to avoid, namely, the rea-sonable possibility that the study arms are notcomparable with respect to important character-istics of the study subjects (i.e., covariates). First,one must a priori define these covariates, andsuch choices are based on clinical knowledge ofwhich factors would be expected to correlatewith the primary study outcome.7 Next, these co-variates are then incorporated into the statisticalanalysis (typically using regression models) as away of “adjusting” for the group differences inthe covariates in order to better conduct the pri-mary comparison of the study groups with re-gard to the main clinical outcome. Ideally, thisadjustment model should be used as the primaryanalysis, although very frequently it is not, andonly the unadjusted comparison of the treat-ments is evaluated. It is certainly informative toevaluate model approaches and determinewhether the addition of covariates to the statis-tical analysis does matter.

In Schanler et al., the choice of covariates for

A CRITIQUE OF SCHANLER ET AL. 91

their adjustment models (which were the sec-ondary analyses of the data) were based on thefinding of baseline significance for any variableobserved. Although this is a common ap-proach, there is a basic flaw to such a choice,particularly with regard to randomized studygroups. In theory, randomization is supposedto eliminate any baseline differences among thegroups. As a result, any statistical differencesare essentially Type 1 errors. Therefore, if avariable is selected this way, it needs to have asound clinical basis for selection, as noted. Tak-ing this approach, it is curious that the investi-gators did not include any of the significant so-cial characteristics (e.g., household income,education) as covariate adjusters. Could onenot argue that such socioeconomic variablesmight have both positive and negative impacton the mother’s nutrition and, in turn, the qual-ity of the milk? Finally, it is also considered ap-propriate practice to include the stratificationvariables (gestational age and receipt of prena-tal steroids in Schanler et al.’s study) as part ofthe adjustment model. They used the latter, butonly because it remained significant at baselinein spite of the stratification.

THE INTENT-TO-TREAT PARADIGM

Last, another important concept in the anal-ysis of clinical data is that of the intent-to-treat(ITT) paradigm. This has been defined in vari-ous ways in the clinical and statistical litera-ture. Simply put, the ITT paradigm that hasbeen universally adopted, particularly by theregulatory agencies, demands that all data col-lected be evaluated.8 The simple notion behindITT is that the primary analysis should reflectclinical practice “warts and all.” Thus, an ITTanalysis provides the most conservative com-parison of the study data, making statisticalsignificance with this approach that much moremeaningful. The definition of intent-to-treat(ITT) used in the Schanler et al. study is worthnoting. Cases of LOS and/or NEC that oc-curred before a milk intake of 50 mL/kg wasachieved were dropped. Although the authors’argument that they wanted these outcomes torelate to milk exposure is understandable, thisis not in keeping with the ITT concept.

CONCLUSION

In this brief exposition of the design and an-alytical issues associated with studies of hu-man milk and preterm formula, a number ofmethodological and analytical issues havebeen noted. In particular, the difficulties in us-ing a standard, completely randomized ap-proach create unusual statistical problems thatmust be handled in a thoughtful manner. Thispaper has illustrated these issues through theevaluation of the study recently reported bySchanler et al. Although this was an earnest at-tempt to provide clinical answers to nutritionalquestions in premature infants, the statisticalproblems with this study bring its conclusionsinto question. As noted, it appears that thistrial was not able to demonstrate any impor-tant clinical differences among MM, DM, andPF, particularly with regard to their primaryoutcome of infection-related events. Althoughtheoretically and practically MM is the nutri-tion of first choice for premature infants, thisstudy could not adequately show that. Fur-thermore, there is no real evidence that DM issubstantially worse, and necessarily equiva-lent to PF. With regard to this latter point, atrial directly comparing DM and PF needs tobe undertaken with a reasonable sample size.(A study merely looking at NEC rates wouldrequire at least 900 subjects to be adequatelypowered.) Furthermore, from a practical per-spective, the formulation of DM must be doneunder carefully controlled conditions, particu-larly requiring consistency in the caloric con-tent and constituents of the milk.

REFERENCES

1. Schanler RJ, Lau C, Hurst NM, O’Brian Smith E. Ran-domized trial of donor human milk versus preterm for-mula as substitutes for mother’s own milk in the feed-ing of extremely premature infants. Pediatrics 2005;116:400–406.

2. Narayanan I, Prakash K, Bala S, et al. Partial supple-mentation with expressed breast-milk for preventionof infection in low-birth-weight infants. Lancet 1980;ii:561–563.

3. Lucas A, Cole TJ. Breast milk and neonatal necrotizingenterocolitis. Lancet 1990;336:1519–1523.

4. Rosner B. Fundamentals of Biostatistics, 6th ed., Chapter8. Thomson Higher Education, Belmont, CA, 2006.

LEE92

5. Rosner B. Fundamentals of Biostatistics, 6th ed., Chapter12. Thomson Higher Education, Belmont, CA, 2006:573.

6. Baker CJ, Melish MB, Hall RT, et al. Intravenous im-mune globulin for the prevention of nosocomial infec-tion in low birth weight neonates. NEJM 1992;327:213–219.

7. Piantadosi S. Clinical Trials: A Methodological Perspec-tive. John Wiley & Sons, New York, 1997.

8. Chow S-C, Liu J-P. Design and Analysis of Clinical Tri-als: Concepts and Methodologies, 2nd ed. John Wiley &Sons, Hoboken, NJ, 2004.

Address reprint requests to:Martin L. Lee, Ph.D., C.Stat.

Department of BiostatisticsUCLA School of Public Health and

Prolacta Bioscience605 E. Huntington Drive

Monrovia, CA 91016

E-mail: [email protected]

A CRITIQUE OF SCHANLER ET AL. 93

This article has been cited by:

1. Christina J. Valentine, Georgia Morrow, Soledad Fernandez, Parul Gulati, Dennis Bartholomew, Don Long, Stephen E. Welty,Ardythe L. Morrow, Lynette K. Rogers. 2010. Docosahexaenoic Acid and Amino Acid Contents in Pasteurized Donor Milkare Low for Preterm Infants. The Journal of Pediatrics 157:6, 906-910. [CrossRef]