cox regression methods

Upload: szu-yu-kao

Post on 06-Apr-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Cox Regression Methods

    1/16

    Notes on Some Aspects of Regression Analysis

    Author(s): D. R. CoxReviewed work(s):Source: Journal of the Royal Statistical Society. Series A (General), Vol. 131, No. 3 (1968), pp.265-279Published by: Blackwell Publishing for the Royal Statistical SocietyStable URL: http://www.jstor.org/stable/2343523 .

    Accessed: 10/01/2012 02:03

    Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

    JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms

    of scholarship. For more information about JSTOR, please contact [email protected].

    Blackwell Publishing andRoyal Statistical Society are collaborating with JSTOR to digitize, preserve and

    extend access toJournal of the Royal Statistical Society. Series A (General).

    http://www.jstor.org

    http://www.jstor.org/action/showPublisher?publisherCode=blackhttp://www.jstor.org/action/showPublisher?publisherCode=rsshttp://www.jstor.org/stable/2343523?origin=JSTOR-pdfhttp://www.jstor.org/page/info/about/policies/terms.jsphttp://www.jstor.org/page/info/about/policies/terms.jsphttp://www.jstor.org/stable/2343523?origin=JSTOR-pdfhttp://www.jstor.org/action/showPublisher?publisherCode=rsshttp://www.jstor.org/action/showPublisher?publisherCode=black
  • 8/3/2019 Cox Regression Methods

    2/16

    1968] 265

    RegressionMethodsNotes on Some Aspects of Regression Analysis

    ByD. R. CoxImperial ollege[Readbeforehe OYAL STATISTICAL OCIETYonWednesday,arch 0th, 968,thePresident,r F. YATES, C.B.E., F.R.S., intheChair]

    SUMMARYMiscellaneousommentsre made on regressionnalysis nder our roadheadings: egressionf a dependent ariable n a single egressorariable;regressionnmany egressorariables; nalysis fbivariatendmultivariatepopulations; models with componentsof variation.1. INTRODUCTIONTHIS s an expository aper consisting ot of new results ut of miscellaneous ndisolatedcomments n the theory f regression. he subject s a very road one andthepaper s in no sensecomprehensive.n particular, he deas of regressionre thebasis of muchwork n timeseriesanalysis nd in multivariatenalysis nd thesespecialized ubjects rebarelymentioned; orare experimentalesign nd samplingtheory roblems ssociatedwithregression onsidered. Another erious imitationto thepaper s theomission f relevant artsoftheeconometriciterature.Two general ituations re distinguishednd in their implest orms re:(i) a dependent ariable Y has a distribution epending n a regressor ariable xand itis required o assessthisdependence;(ii) there s a bivariate opulationof pairs X, Y) and the oint distributions tobe analysed.Capital etters reused forobservationsepresentedy randomvariables nd lower-case letters orother bservations.

    Most theoretical iscussionof regression tartsfroma quite tightly pecifiedmodel nwhich omeobservationsreregardeds correspondingo random ariableswithprobability istributionsependingn a given way on unknownparameters.Many of the difficultiesf regression nalysis,however, oncern uchquestions swhich bservations houldbe treated s randomvariables,whatare suitable amiliesofmodels, nd what s thepractical nterpretationftheconclusions.The special computationalnd otherproblems ssociatedwithfittingon-linearmodelswill not be considered xplicitly,lthoughmuch of the discussion ppliesas much to suchproblems s to thesimplerinearmodelswithwhich hepaperisovertlyoncerned. Particularpplicationswillnot be discussed n detailbutone ortwotypical uthypotheticalituations illbe outlinedater or llustration. hrough-outthediscussion hemeasurementfuncertaintyy significanceests ndconfidencelimits s importantutnotparamount.Inevitably hepapertendsto emphasizedifficultiesikely o be encountered; fcourse awareness f potentialdifficultiess a good thing, ut onlyas one facetofconstructivecepticism.

  • 8/3/2019 Cox Regression Methods

    3/16

    266 Cox - Notes onSome Aspects fRegression nalysis [Part3,The methodologyf regressions describedy Williams1959)and by Draper

    and Smith1966) nd themore heoreticalspects y Plackett1960),Kendall ndStuart1967,Chapters6-29) nd Rao (1965).2. REGRESSIONN A SINGLEREGRESSORARIABLESuppose thatthere re n pairs of observationsxl,Y,),..., x., Y.) and that foreachvalue f he egressorariable theres a populationfvalues f he ependentvariablerom hichheobservedi's rerandomlyhosen.This s often aken orgranteds the tartingoint or theoreticaliscussionfregression;n i) and ii)whichollow,ome f ts mplicationsre discussednd then n iii)-(vi) ommentson somemore dvancedmattersremade.

    (i) Choiceof dependentnd regressor ariables. n some situations he x valuesmay e chosen eliberatelyy he xperimenternd Y is a responseependentnx.The moredifficultituations when oth ypes f observationan be regardedsrandom. henwe take s dependentariable:(a) the"effect",heregressorariable eing heexplanatoryariable; or givenvalueof the explanatoryariable,we ask what s thedistributionf possibleresponses;(b) thevariableobe predicted,heregressorariable eing hevariablen whichthe redictionsto be based. A full olutionothe redictionroblemsto give heconditionalistributionf he ariable o be predictediven ll availablenformationonthe ndividualoncerned.Suppose hatt s reasonableo considern existingrhypotheticalopulationof Yvalues or achx. Nowwhetherr not he 's canberegardeds randommaywell ffecthe nterpretationnd applicationfthe onclusions.oranalysisftheregressionoefficientnthemodel, owever, eargue onditionallyn thex valuesactuallybservedFisher, 956, . 156),providedhat hex valuesbythemselveswouldhavegiven o informationbout heparameterf nterest.Example . A random ample ffibreegmentsffixedengthss taken romhomogeneousource ndfor ach egmenthemean iametersmeasuredccuratelyand thebreakingoad determined.othobservationsanbe regardeds randomvariables ut nview f a) it sreasonableo takebreakingoad,or ratherts og, sdependentariablend ogdiametersregressorariable. he amemodelwould eappropriatef, or xample, fixed umberffibress selectedandomlyrom achof numberfdiameterroups.Whetherhe egressions a fruitfulhingoconsiderdependsn ts tability,.e. on whetherheresa reproducibleelationshipnvolved;see iii).Example . A moredifficultase is illustratedya calibrationxperimentnwhich orn individualsa) a "slow" measurementnd b) a "quick"measurementare madeofsomeproperty.ftena) is a definitiveetermination,or xamplenopticalmeasurementffibre iameter,nd (b) is theresult f some ndirectndmuch asiermethod. n future,nly he"quick"measurementill be obtainedand tis requiredopredictromhis hecorrespondingslow" measurement.fthen individualsnitiallybservedre a randomample romhe amepopulationas the ndividualsorwhich utureredictionsre tobemade,wetake he slow"measuremento be thedependentariable,ince his s the oneto be predicted.Suppose, owever,hat hen individuals ere hosen ystematically,or xampletohave"slow" values pproximatelyvenly istributedver herange f nterest.

  • 8/3/2019 Cox Regression Methods

    4/16

    1968] Cox - NotesonSomeAspects fRegression nalysis 267Usually twouldbe reasonableo regard he"quick"valueas having randomcomponentnd,providedhat physicallytable andomystems involved ndtherelationshipinearowrite"quick" cxf "slow" "error",where he"error" oesnotdepend n the"slow"value. Given new"quick"measurement,e haveto estimate non-randomariable yinversestimation(Williams,959, p.91,95). If bothvariablesan nfactbe regardeds random,thesecond pproach s inefficientecause t ignoreshe nformationboutthemarginalistributionfthe slow"measurement.(ii) Theomittedariables.uppose hat is a regressorariablehatmight avebeen ncludedntheregressionnalysis ut nfact snot,for xample ecausenoobservationsf t re vailable.What ssumptionsbout aremadewhen e onsidertheregressionf Y on x alone? Box 1966)has givennilluminatingiscussionfthedangers f omitting relevantariable.The relationshipgnoring will bemeaningfulf:(a) changesn z havenoeffectn Y;(b) in a randomizedxperimentnwhich correspondso a treatment,heremay ea unit-treatmentdditivity.hentheusualanalysiswillgive n estimate f theeffectfchanging, andan estimatef the tandardrror.The estimateefersothedifferenceetweenheresponse n one unitwith ertain alues orx and forthe mittedariable andwhatwouldhavebeen bservednthat ameunitwithdifferentalue fx and the ame ;or (c) z is a randomariable,ayZ, and thedistributionsfZ, given ,and ofY,given = z andx,arewell efined.fx is a randomariable , this mountso therequirementhatX,Y,Z) have well-definedhree-dimensionalistribution.hentheregressionfY on x iswelldefinednd ncludes contributionssociated ithchangesnz.Example . Considerbservationalataforndividualsith nultimatelyataldisease,Y being he ogtime odeath ndx some spect f the reatmentpplied,called hedose, nd that he egressionf Y onx isanalysed. hemissingariablez is thenitialeverityf he isease. f, smight ell e the ase, largelyeterminesx, theregressionf Y onx, although elldefined nder hecircumstancesf c),would e ofveryimitedsefulness.nparticulartwouldnotgive or particularindividualn estimatef the ffectn his Y ofchangingose. This s an extremeexamplef difficultyhat pplies omany egressiontudiesased nobservationaldata.(iii) Stabilityf regression. hile fittedegressionquationmayoften eusefulimplys a conciseummaryfdata, t sobviouslyesirablehat he elationshould e stable ndreproducible.his s stressedyEhrenberg1968) nd Nelder(1968); see also Tukey1954). Stability ightmean hatwhen heexperimentsrepeated nder ifferentonditions:(a) the ame egressionquation olds,ven houghtherspectsf he ata hange;or (b) parallel egressionquationsreobtained;or (c) satisfactoryegressionines realways btained utwith ifferentositionsandslopes.Incases b) and c) the ittingfregressionineswill e an mportantirsttepn theanalysis. hesecondtepwill etotryoaccount or he ariationn the arameterswhose stimateso vary ppreciably,ossibly ya furtheregressionnalysisn

  • 8/3/2019 Cox Regression Methods

    5/16

    268 Cox - NotesonSomeAspects fRegression nalysis [Part3,regressorariablesharacterizinghedifferentroupsfobservationsndtakingheinitial egressionoefficientss dependentariables.n testinghesignificancefthedifferencesetweenheregressionoefficientsndifferentroupstwillofteneimportantoallowfor orrelationsetweenhegroupsfdata Yates,1939).(iv) Choicef elationobefitted. his hoicewill ependnpreliminarylottingand inspectionf thedata and possiblyn theoutcome f earlier nsuccessfulanalyses.naddition,hemodelmay ake ccountf:(a) conclusionsromreviousets fdata;(b) theoreticalnalysis,ncludingimensionalnalysis,fthe ystem;(c) limitingehaviour.Further,nygivenmodel anbe parametrizednvariousways nd, nchoosingparametrization,hefollowingonsiderationsayberelevant:(a)' individualarametershouldhave a physicalnterpretation,ayin terms fcomponentsna theoreticalodel r ntermsf ombinationsfregressorariablesofphysical eaning;(b)' individualarametersnd estimateshouldhavea descriptiventerpretation,for xamplentermsfthe veragelope ndcurvaturefresponsever herangeconsidered;(c)' interpretations,uch s those f a)' and b)',should e nsensitiveosecondarydeparturesromhemodel;(d)' any nstabilityetween roups houldbe confinedo as fewparametersspossible;(e)' thesamplingrrorsfestimatesfdifferentarametershouldnot be highlycorrelated.These equirementsay e tosome xtent utuallyonflicting.There s notspacehere o discuss ndexemplifyll these oints.As justoneexample,reliminarynalysisfdata ofExample might,fthex's haverelativelylittle ariation,uggesthat inear egressionsfbreakingoad ondiameternd oflogbreakingoadonlogdiameter ould it bout quallywell.Thesecondwouldingeneral epreferableecause,with espectothe bove onditions:(b) it permits asier comparisonwith the theoreticalmodel breakingoadoc (diameter)2;(c) itensureshat reakingoadvanishes ith iameter;(a, b)' theregressionoefficient,eing dimensionlessower,s easier o thinkabout han coefficientavinghedimensionsf oad/diameter.(v) Goodnessffit. his anbeexaminedn a numberfways:(a) bya non-probabilisticraphical r tabularnalysis,or xample fresiduals;(b) by significanceest, sing s a test tatisticome spect fthedatathoughtobea reasonable easurefdepartureromhemodel.Thus, he tandardizedhirdmomentfthe esidualsouldbeused,fpossiblekewnesssof nterest;(c) by thefittingf an extended odel educingo thegivenmodel orparticularparameteralues.Themost amiliarxamples the nclusionfthenewregressorvariable, ossibly power f thefirst ariable, r a product fvariables, henmultipleegressionsbeing onsidered;(d) bythefittingfa quitedifferentodel, eeingwhethert fits etterhan heinitial ne.Such xaminationfthe dequacyfthemodel s importantfmodels retoberefinednd mproved. ften,utnot lways,heprimaryspect fthemodelwillbetheform fthe egressionquationndthe dequacy fwhatmaybesecondary

  • 8/3/2019 Cox Regression Methods

    6/16

    1968] Cox - NotesonSomeAspects fRegression nalysis 269assumptionsboutconstancyfvariance,ormalityfdistribution,tc.willbeofratheress mportance.ormal ignificanceests revaluable ut, fcourse, eedcorrectnterpretation.very ignificantackoffitmeans hat here s decisiveevidencef ystematiceparturesromhemodel; everthelesshemodelmay ccountfor nough fthevariationo be very aluableNelder, 968).A non-significanttest esultmeanshat nthe espectestedhemodel sreasonablyonsistentithhedata;neverthelessheremay eothereasons or egardinghemodel s inadequate.Of a)-(d)theeast tandards d). It scloselyelatedo the roblemf hoosingbetweenlternativeegressionquationsWilliams,959,Chapter ), for xamplebetweenhe egressionfY onxl alone ndthat fY onx2 lone.Themore sualprocedurensuch casewill, owever,eto fit oth ariablesocoverhe ossibilitythat heoint egressionsappreciablyetterhan ithereparate ne.

    As a ratherifferentxample,uppose hatnormalheoryinear egressionsf(os) Y on x, (3) Y onlogx, (y) logY on log areconsidered.hegoodness ffitof os)and 3) can becomparedescriptivelyy he esidual ums fsquares, ut ocompare ay os)withy)theresidualumofsquares annot e useddirectly.hemost sualprocedures probablyhen ocomparequared orrelationoefficients,but or omparisonf he ullmodelst sprobablyetterocomparehemaximizedlog likelihoods f Y1, .., Y. underthe twomodels. Cox (1961, 1962)has discussedthe onstructionfsignificanceests nsuch ituations.Analternativend nmanywayspreferablepproachs toconsider compre-hensivemodel ontainingos), /), (y) as special ases. For example henormaltheoryinear egressionfy112 - 1 xA1_ 1A2 on

    could e takennd llparameters,ncludingA1, 2), stimatednd estedymaximumlikelihoodBox andTidwell, 962; Box andCox, 1964).This s computationallyformidablef here reseveralegressorariables.(vi) More complexdependence n the regressor ariable. In most regressionanalysest is assumed hat hedependencen theregressorariable is confinedto changesntheconditional eanof Y. Transformationf Y maybe necessarytoachievehis; fdifferentransformationsrerequiredo inearizehe egressionfthemeanand to stabilize ariancehe first illusually avepreference,implybecause rimarynterestill suallyie nthemean.To study,or xample,hangesinthe onditionalariancef Ywe can:(a) plotresiduals;(b) groupnx,calculateariances ithinroups,fnecessarypplyingnadjustmentforchangesofmeanwithin roups, nd thenconsider heregressionn x of logvariance;(c) fit, or xamplebymaximumikelihood, model n whichparametersre addedto accountforchanges nvariance. For example, hevariancemight e taken o beu2exp {y(x-x)}. It wouldnearly lwaysbe right o precede nysuchfitting y a)or (b) in order o getsome dea of an appropriatemodeland ofwhether he morecomplex ittings likely o be fruitful.Similarremarks pplyto the studyof changesof distributionalhape.If theregressionn themean s linearbut there resubstantialhangesnvariancea weightednalysiswilloften e required, lthough hechanges n variancehavetobe quite substantial efore here s appreciablegain in precisionn estimatinghe

  • 8/3/2019 Cox Regression Methods

    7/16

    270 Cox - Noteson SomeAspects fRegression nalysis [Part3,regressionoefficient.f coursehe hangesn variancemaybeof ntrinsicnterest,orneed eparatetudyn order o specifyowtheprecisionfpredictionependsonx.

    3. REGRESSIONN SEVERALEGRESSORARIABLESSupposenow thatfor ach ndividualeveral egressorariablesreavailable,i.e. that or he th ndividual eobserveYi,xi,..., xip). Weconsidermainlyhecase wherexi, ..,xp are physically istinctmeasurementsather han,forexample,powersf single . Virtuallyllthe iscussionfSection isrelevant,ut hererenew pointsmostlyonnected ith hechoiceof theregressorariablesnd theinterpretationfsituationsn which heres appreciableon-orthogonalitymongtheregressorariables.There retwo xtremeituationsoconsider.n the irsthenumberfregressorvariablessquite mall, aynotmore han hreer four. t is then erfectlyeasibleboth o fit he2P possible egressionquationsndto examine hem ndividually.Thoseregressorariableshenature f whose ffectss clearlystablishedanbeisolatedndambiguitiesrisingromon-orthogonalityfother ariablesisted nd,as far as possible,nterpreted.lso furtheregressionquationsnvolving,ay,squaresndcross-productsfsome f theoriginalegressorariablesan be fitted,ifrequired.n the econd ase thenumber ofregressorariabless larger.tmaystill e computationallyeasibleofit ll2Pequations,utunlessllpairs fregressorvariablesrenearlyrthogonal,he nterpretationslikelyobedifficultnd, ttheleast, omefurtherechniquesrerequiredorhandlinghe nformationromhefits. nmany pplicationsf this ype heres a reasonable opethat nly fairlysmallnumberfregressorariables ave mportantffectsver heregiontudied.The broad distinctionetweenhese wocasesshouldbe borne n mind n thefollowingiscussion.(i) Interpretationnd objectives.indley1968)has emphasizedhat hechoicebetweenlternativequationsependsnthe urposef he nalysisndhasdiscussedtwocases n detail rom decision-theoreticiewpoint,nea predictionroblemandone a control roblem.His resultshowvery xplicitlyheconsequencesfstrongssumptionsbout heproblemndare ikelyo beuseful uidancen othercasestoo. Thefollowingemarksefero caseswhereessexplicitssumptionsre

    possiblebout henaturef he roblemndthe bjectivesf he nalysis.Suppose irsthat he bjectives topredict for uturendividualsn the egionofx-spaceovered y he ata. Inparticular,he 'smay erandom ariablesnd henew ndividualse drawn romhe amepopulation. hen nyregressionquationthat its hedataadequately illbe aboutequally ffectiven theaverage verseries fx-values. f,however,t sthoughthat ot llregressorariablesontribute,theres likelyo be a gainfromxcludingegressorariables ith n insignificanteffect. otethat Bayesiannalysisf this ituationuggestseducinghe ontri-bution f,ratherhan liminating,uchvariablesnd this s sensiblelsofromsamplingheoryiewpoint.o long sp is small ompared ith , t s not ikelytomake major ifferencehichf hese ariousossibilitiess taken.The lgorithmof Beale et al. (1967)for electinghe"best"equationwith specifiedumberfregressorariablesnd the various utomatictepwiseroceduresescribedyDraper ndSmith1966,Chapter)willberelevant.Suppose ext hat hepredictions to bemadefor n individualna new egionofx-space.Thingsrenowdifferent.orexample,upposehat l andx2 re lmost

  • 8/3/2019 Cox Regression Methods

    8/16

    1968] Cox - Notes onSome Aspects fRegression nalysis 271linearlyelatedn the nitialample fobservationsndthat hepartial egressioncoefficientsre nsignificant,hecombinedegressioneing ery ighlyignificant.It is thusknown hat t leastoneof xl andx2 has an importantontribution,uttherewillbemany egressionquations ittinghedataabout quallywell.Underthe ircumstancesf thepreviousaragraphhis s immaterial,ut fpredictionfY is attemptedorxl,x2) farfrom heoriginalinear elation,xtremelyifferentresults ill eobtainedrom hedifferentits. n such ases hepossibilitiesre:(a) topostpone ettingp a predictionquation ntil etter ataareavailable orestimation;(b) to useexternalnformationodecidewhichs"really"he ppropriatequation;(c) to usetheformal ariance fpredictionrom hefull quation s a means fdetectingndividualsorwhichredictionromny egressionquationshazardous.

    Thenext, ndinmanyways he most mportant,ase is wherewe hopethatthere s a unique ependencef Y on some, r all, oftheregressorariableshatwillremaintable ver range fconditionsnd wewish oestimatehis elationand nparticularo dentifyhe egressorariableshat ccur n t. In a randomizedexperimenttmay deally e possible o estimatehecontrastsfprimarynterestseparatelyndefficiently,oshow hat hey o not nteract ithxternalactorsndtheyccount ormost f hevariability.venhere hereredifficulties,articularlyif theresponseurfaces relativelyomplicated.orobservationalata, here retwomajordifficulties:(a) thepossibilityf mportantmittedariablesseeSection , pointi);(b) ambiguitiesrisingromppreciableon-orthogonalityfregressorariables.Theresdiscussionelow f ome f he eviceshat anbeused otryoovercomeb).In the ituationontemplatednthe reviousaragraph,he bjectivesessentiallythe ame s that n a randomizedxperiment.moreimitedbjectivestoanalysepreliminaryata n order o suggest hich actors ouldbe worthncludingn asubsequentxperimentndtosuggestppropriatepacingor he evels. t would einterestingo examineheperformancefsome imple trategies,ven houghherewill lways efurthernformationo be takennto ccount.(ii) Aids o nterpretation.n some ases hemainnterestayie nthe egressiononxl,the ariable2beingncludedscharacterizingaydifferentroupsfobserva-tions, r somepotentiallymportantspect fsecondarynterestn theparticularinvestigation.fx2canconvenientlyegrouped,twill ften egoodto fit eparateregressionsnxlwithinachx2groupndtheno relate he stimatedarametersox2.This eads o ananalysisf he tabilityf he egressionquationndpossiblyothe onstructionfmodels ontainingnteraction. oregenerallyl andx2may esets fregressorariables.Iftheres a propertythat sthoughtot ohave neffectn Y, t will ftenegoodto nclude as a regressorariable. ignificantegressionnx would hen ea warning,or xamplef an importantmittedariable.The next etofremarkseferoambiguitiesrising rom on-orthogonalityndalldepend pon ntroducingurthernformationnsome orm.(a) It maybe thoughthat heregressionoefficientn say xl should e non-negative.nsome pecial ases hismay esolvenapparentmbiguity.or nstance,supposehat l andx2 reclosely ositivelyelated,hat he ombinedegressionslarge, ut he artial egressionsre nsignificant,hat nxl being egative.ncident-ally he ttitudeoassumptionsuch s that bout he ign f regressionoefficientneeds omment.hat aken eresthatny uch ssumptionhould,o far spossible,

  • 8/3/2019 Cox Regression Methods

    9/16

    272 Cox - Notes on Some Aspects fRegression nalysis [Part 3,be tested n thedata and, fconsistent ith hedata, tsconsequenceshould eanalysed ndcompared ith heconclusionsithouthe ssumption.t might eargued rom Bayesian iewpointhat prior robabilityhould e attachedo theassumptionnd a single onclusionbtained, ut, ven part rom hedifficultyfdoing his uantitativelyn a meaningfulay, tseems ikely hat hemore autiousapproach ill emorenformative.(b) Theremay e sets fregressorariables hich re o a large xtenthysicallyequivalent. orexample,n a textilexperimentarn trengthanbe measured yseveral ifferentethods.Quiteoften hemeasurementsaybe expected o behighlyorrelatednd equivalents regressorariables,lthoughhedata may howthis xpectationo be false. n applicationsike his t willbenatural o try o usethroughoutne regressorariable, ossibly simple ombinationf the separatevariables,rovided hat thisdoes not give an appreciably orsefit than fullfitting.(c) Kendall1957, . 75)suggestedpplyingrincipalomponentnalysis o theregressorariablesnd then aking ew egressorariablespecifiedythefirstewprincipalomponents. effers1967) and Spurrell1963) havegiven nterestingapplications. difficultyeems obethat heresno ogical eason hyhe ependentvariable hould otbe closely ied o the eastmportantrincipalomponent.hefollowingodificationsworthonsidering.heprincipalomponentsay uggestsimple ombinationsfregressorariableswithphysicalmeaning.Thesesimplecombinations,ot heprincipalomponents,anbe used s regressorariablesndif goodfit s obtainedconstructive,lthoughotnecessarilynique, implificationhasemerged.fthe egressorariablesan bedividedntomeaningfulets, .g. ntophysicalmeasurementsndchemicalmeasurements,eparate rincipalomponentanalysesould econsideredor he wo ets.(d) In some ituations,speciallynthephysicalciences,hemethodfdimen-sional nalysismayeadto a reductionn the ffectiveumberfregressorariables.(e) Anothereneral ay ofclarifyingheregressionelation hen omeof theregressorariables rerandom ariabless to examine lausiblepecialmodels orthe nterrelationshipsetweenllthevariables. here retworatherifferentases.Ifthe dditionalssumptionsannot e tested rom hedata, hen arametersotpreviouslystimable aybecome o, and those reviouslystimable ayhavetheprecisionfestimationncreased. n theother and,fthe dditionalssumptionscan be tested, hen hegain s confinedo improved recision. ewallWright'smethod fpath oefficientss essentiallydevice orhandlingomplex ystemsfinterrelations.orgeneral iscussionfpath oefficientsot pecificallyngeneticterms,ee Tukey 1954),Turner nd Stevens1959),Turner t al. (1961) and,particularlyor he onnection ithmultipleegression,empthorne1957,Chapter14). Themost amiliarxamplefthe econd ype fsituationstheuseofa con-comitant ariable o increase heprecisionf treatmentontrastsn controlledexperiments.hen he oncomitantariables measured efore he reatmentsreapplied,he pecialmodels ustifiedytherandomizationf treatments.nothersimple xamplesthe seof n ntermediateariableCox, 1960).Here he egressionofYon X1 sof nterestndthe uppositionsthatX2 s a furtherariableuch hat,given 2 x2,Yis ndependentfX1.Then, nder ome ircumstances,bservationofX2can ead toappreciablencreasentheprecisionfthe stimatedegressionfY onX1. Inotherpplications,nalysisf ovariances used oseewhetherhedataare n accordwith hehypothesishatY is affectedyX1only iaX2.

  • 8/3/2019 Cox Regression Methods

    10/16

    1968] Cox - Noteson Some Aspects f Regression nalysis 273(f) A very pecial ase swhen heregressorariables anbe arrangedn order

    ofpriority.he main ases rethe ittingfpolynomialsndFourier eries.(iii) Analysis f set offittedegressions. or problems nwhichmany lternativeequations, orexample ll 2P linear egressions,re fittedo the same data, thehandling f theresultingnformationeeds omment.n a predictionroblemnwhichhepredictionsre to bemadeover setofx values istributedn much hesameway s thedata, naverageariance fprediction,rbetterhe orrespondingstandard eviation, illoften e a reasonablemeasure fadequacy; f course, nsome pplicationsheremaybe particularoints n x-space t which redictionsrequired. hoseequationsignificantlyorse han heoverall it an be identifiedin someway.Notethat n equation ignificantlyn conflict ith hedata maybeused, or xampleecause t nvolvesubstantialconomynthenumberfvariablesto be measured. hiswould ereasonablef he tandard eviationfpredictionsthoughtatisfactory,ut heuse ofthis quationn a new egion fx-space s likelytobeespeciallyazardous.Where eare ookingor (hopefully)nique elation,he irsttepwill ften eto ist llequationsf particularypehat renot ignificantlyontradictedy hedata, s a preliminaryotryingonarrow own he hoice y ome f he rgumentssketchedn (ii). Automatic evices orselectingquationswould be usedwithgreat autionfat all. Gorman nd Toman 1966) nd Hockingnd Leslie 1967)have discussed omefurther ethodsnd in particular aveoutlined ome f herecent npublishedork f Dr C. L. Mallows. A differentpproach s takenbyNewtonndSpurrell1967a, )who ntroduceuantitiesalled lementsosummarizethe etof ll2Pregressionums f quares.Particularautions necessaryn examininghe ffectfregressorariables hichvarymuchessn he ata hanwould eexpectednfuturepplications.he tandarderrors f theregressionoefficientsillbe high nd theres an obvious angernjudginghepotentialmportancef suchvariablesolely rom he tatisticalignifi-cance ftheiregressionoefficients.

    4. BIvARIATEOPULATIONSConsidernow situationsn which he observations re pairs X1,Y1),.., (Xn,Yn)drawn rom bivariateopulationnd in which heres no particulareason orstudyinghedependencef Y on X ratherhan hatof X on Y. The exampleconcerningeightsndweightsf schoolchildreniscussedy Ehrenberg1968) san instance. ither r bothregressionsould egitimatelye considered,ut thequestions whethert sfruitfulo do so.With nehomogeneouset fdata he oncise escriptionf heoint istributionisallthat an beattempted,nthe bsence f more pecificbjective. hismay edonebya frequencyableor byan estimatefthe ointcumulativeistributionfunction,r someparametricivariateistributionan be fitted.While here asbeendiscussionfspecial amiliesf bivariateistributionsther han hebivariatenormalPlackett,965;Moran, 967) he ivariateormal istributionsneverthelesstheone most ikelyo arise.Preliminaryransformationaybe desirablendonepossibilitys to considerransformationsromx,y) o

  • 8/3/2019 Cox Regression Methods

    11/16

    274 Cox - NotesonSomeAspects fRegression nalysis [Part3,and toestimateA1, 2) ymaximumikelihoodBoxandCox,1964), ssuminghaton the ransformedcale bivariateormal istributionoesapply. n some ppli-cationstmay ereasonableotakeA1 A2.If bivariateormalistributionsfitted,stimatesffive arametersrerequiredandthesemight,or xample, e themeansp, py) thevariancescr2,cr2) andthecorrelationoefficient; see,however,ection , point iv)for emarksnpara-metrization.Whenhererekpopulationshe roblem ill e todescribehe et fpopulationsina conciseway.There remany ossibilities.fteneparate escriptionsillbeattempted f a) themeans p$i yi) (i = 1, .., k) and of b) theparametersetermin-ing he ovariance atrices. or a) such uestions ill rise s whetherhemeanslie on oraround lineorcurve nd ofwhetherheir ositionanbe linkedwithsome ther ariableharacterizinghepopulations.hrenberg's1963) riticismsfregressionpplied o bivariate opulationsre partly irected t confusionsfcomparisonsetweenopulationsith hosewithinopulations.Ifthe ovariance atricesrenot onstant,t willbenatural o ookfor spectsthat reconstantndthesemightnclude neor other egressionoefficient,heratioof thestandard eviations,hecorrelationoefficient,tc. Anychangesncovariance atrix ay e inked ith hangesnmean.Of courseonce a potentiallyeasonable epresentations obtained,tandardtechniques,speciallymaximumikelihood,re available orfittingndforcon-structingignificanceests. n many ases,however,hemost hallengingroblemwill be to discoverhemostfruitfulonciserepresentationmongthemanypossibilities.Alltheremarksfthis ectionpplynprincipleo p variate roblems.

    5. MODELSWITH CoMpoNENrsFVARiATIONIn themathematicalheoryfregressionhemostwkwardroblemsreprobablythosenwhichhe bservationsre plitnto omponentsotdirectlybservable,ndtherelationshipsetweenhese omponentsre to be explored.There s a veryextensiveheoreticaliteraturensuch ituations;ee, nparticular,indley1947),Madansky1959), ukey1951), prent1966), isk 1967), endallndStuart1967)andNelder1968).Inthisectionfewommentsnsuch ystemsill emade, articularlynpointswhich onnect ith hepreviousiscussion.The implestituationswherenly he ependentariablessplitnto omponents,a hypotheticalrue alue nda measurementrsamplingrror. hemain uestion,easilynswered,s then o assesshowmuch f theobservedispersionf Y aboutitsregressionn x is accountedorbythe measurementr samplingrror.ForexampleY might e thesquarerootofa Poissondistributedariable, hen hesamplingrror asvariance early /4.Onewould, n particular,ant o knowwhetherhis ccountedor ll the andomariationresent.Moredifficultroblemswould rise f twere equiredo estimatehedistributionalorm fthe hidden"componentfrandomariation.The morenterestingasesarewhere oth ndependentndregressorariablescanbe split nto omponents:

    Xi= ='i+Y=Ti+ , JifT +PD+ cr

  • 8/3/2019 Cox Regression Methods

    12/16

    1968] Cox - NotesonSomeAspects fRegression nalysis 275Here epXiaremeasurementrsampling rrors fzeromeanand EIF,i is a deviationfromheregressionine, gainofzeromean.The simplestase is whereDi,the"true" alueoftheregressorariable,s a randomariable.Random ariablesordifferentare ssumedndependentnd he riplee, -qi,yo,) sassumedndependentof Di. Various asesmay rise or he ovariance atrixfthe riple,he implestbeinghat he hreeomponentsremutuallyndependent.isk 1967) ndNelder(1968)have onsideredodelsnwhichhe egressionoefficients a randomariable.Sometimest sconveniento write/3,nsteadfP todistinguishtfromyx,thepopulationeast quares egressionfY on X. Infact

    Ao= p varX)/3YXvarX)-var(f)IfpredictionfYdirectlyrom s the bjective,yx isrequired,otg/3,,olongas X is a random ariable;f,however,hefuture 's atwhich redictions to beattemptedre notrandom, rcomefrom differentistribution,hepresence fthe omponentsdoesneed onsideration.Muchpublishediscussiononcentratesnthe stimationfP and nparticularon the ircumstancesnderwhich isconsistentlystimable;or ome urposest senoughonote hat isbetweenyx and //3y Moran, 956). Thesimplestaseis when are) canbeestimatedromeparateata, ornstancerom ithineplicatevariation,r heoretically.uite ftenhe orrectionactor,/Py_/ svery ear ne.Somefurtherroblemsrisenaturallyndsome, utnot ll,canbeansweredna fairlyirect ay. nmost ases eparatestimatesf t east art f he ovariancematrixf e, j) arerequired. f more han heminimummountf nformationsavailablemore earchingest f themodel s possible.Thefollowingllustratehefurtherroblems:(i) Estimate hethree omponents fvarianceof Y, namely 32varQ), var(E,)andvar(-q).(ii) Inparticular,rethedataconsistentith ll the 'sbeing ero?(iii) Is a discrepancyetweenhe stimatedegressionoefficientf Y onX anda theoreticalalue xplicablenterms f"errors"ntheregressorariable?(iv) Areapparent ifferencesetween roupsn theregressionoefficientsfY onXexplicablen termsf errors"nthe egressorariable?

    (v) In thecontext fSection , (Xi,Yj)mayrefero the amplemeans f theithgroup, D, T) being hecorrespondingopulationmeans. The covariancematrix f e, 1) canbe estimated: hat an be said aboutthe relation etween'Ti and (Di?(vi) Howmuchmore ffectivelyouldY bepredictedrom ifX weremeasuredmore recisely,or xample y dditionaleplication?(vii) s non-linearityn theregressionf Y on X explicableyerrorsn theregressorariable?Innon-normalases,fT has inear egressionn D.Ywillnothave inear egressionnX.)When here s more hanoneregressorariable imilar roblemsrise. Theimportantechniquesased on instrumentalariables illnot be consideredere;see,however,ection ,pointii).6. MISCELLANEOUS POINTSThisfinal ection ealswith numberf miscellaneousopicsnotdiscussedearlier.

  • 8/3/2019 Cox Regression Methods

    13/16

    276 Cox - Notes on Some Aspects fRegression nalysis [Part3,(i) Graphical ethods.hese reverymportantothfor hedirect lottingfscatter iagrams f pairs f variables,ossibly istinguishingther ariables y acoarse rouping,ut lso for he ystematiclottingfresiduals; ee,for xample,Anscombe1961). Particularlyith xtensiveata, he ystematiclottingfresidualsis ikelyo be themost earching ay f estingnd mproving odels. t s possiblethat evelopmentsncomputerisplay evices ill ead o valuable ays f nspectingrelationshipsnvolving ore han wovariables.(ii) Outliersndrobuststimation.he creeningfdatafor uspect bservationswilloften e required.Withimitedata twillbe usualto look at suspect aluesindividuallyn order o decidewhetheroinclude hem n any ubsequentnalysis;oftennalyses ith ndwithoutuspect alueswillbeneeded.With observationsfor ach ndividualhebestwayof ooking or utliers illdepend n the ype f

    effectxpected. hus(a) if ny xtremeeviationsthoughto bein a particularnown ariable, suallythedependentariable,esidualsromtsregressionnthe ther ariableshould eexamined. or furtheriscussion,eeMickeyt al. (1967);(b) suppose hat nyextremeeviations thoughto be confinedo onevariable,butnotnecessarilyhe ame ariableor ifferentndividuals. hismightethe ase,forxample, ith ccasionalross ecordingrrors. neprocedures then ocalculatep residualsor ach ndividual,nefor achvariableegressedn all the thers;(c) ifany ndividual aybe subjecto extremeeviationsn one or more ariablessimultaneously,nd theointdistributions approximatelyvariate ormal,tmaybe reasonable o calculate ortheith ndividual, ithvector bservationi, astandardizedquared istance romhemean , given yDi = (Y -F)' S-1(Y - F),where is the stimatedovariance atrix. hen heordered i's canbe plottedagainstheexpectedrder tatisticsor amples rom hechi-squaredistributionwith degreesf freedom.terationf theprocedure aybe desirable.Wilk ndGnanadesikan1964)havegiven general iscussion f graphicalmethods ormultiresponsexperiments.With xtensiveata,however,tmaybe necessaryo use methods fanalysisthatare insensitiveo outliers,o-calledmethodsf robust stimation;ee,forexample, uber 1964).(iii) Missing alues.Afifind Elashoff1966,1967)havereviewedhe iteratureon missingalues n multivariateata and haveconsideredn somedetailpointestimationnsimpleinear egression.nivariate issingalue heoryoncentrateson thecomputationalspects fexploitinghenear-balancef a balanced esignspoiled y a missingbservation,utno informations contributedythemissingobservations.n a multivariatease,however,nformationaybe contributedyindividuals or which omecomponentbservationsremissing. n a multipleregressionroblem,here s usually o informationrom ndividualsn whichregressorariables missing,nless hat ariablean be regardeds random.Anexceptionswhenheres, ay, n ndividual ith l missingnd nalysisf he therindividualsuggestshe omission f xl from heregressionquation. Suppose,however,hat regressorariablesrandom,ndthe ndividualsith hat ariablemissingan be regardeds selected andomly, quite evere ssumption,hichshould e testedwhere ossible.Then more an be done. In someapplicationsnearlyll ndividuals ayhave t east nemissingomponentndthen se of omemissingalue heorysessential. oughlypeaking,he ovarianceetweenny worandom ariables an be estimatedrom hose ndividualsn which othvariables

  • 8/3/2019 Cox Regression Methods

    14/16

    1968] Cox - Notes onSomeAspects fRegression nalysis 277are available; there eems scopeforfurther orkto settle ust when t is wise to dothis ndwhen omethingmore laborate uchas fullmaximumikelihood stimationis desirable.(iv) Non-normalariation.The present aper s largely oncernedwithproblemstowhich eastsquaresmethods rereasonably pplicable,possibly fter ransforma-tion. In regiession-likeroblemsn whichparticular on-normal istributionsanbe specified,we have usuallyto applymaximum ikelihoodmethods. These arelocallyequivalent o leastsquarestechniquesnd therefore greatdeal of theabovediscussion, orexample that on the choice of regressor ariables, s immediatelyrelevant.Anscombe 1967) considered n some detailthe analysis fa linearmodelwithnon-normalistributionferror;Cox and Hinkley1968) found he symptoticefficiencyf eastsquares estimatesn suchsituations.

    The justificationf maximumikelihoodmethods s asymptotic ut sometimesanalogues of at leasta fewof the "exact" properties fnormal-theoryinearmodelscan be obtained. The simplest ase is whenthe ithobservation n thedependentvariablehas a distributionn theexponential amilyLehmann,1959,p. 50)exp Ai(y)B(61)+ Cy) + D(6i)},where6i is a singleparameter nd there s a linearmodel

    B(6i) = E xirrwhere he/'s are unknown arametersnd the x's known onstants.Special casesare thebinomial, oisson and gammadistributions hen he "linear" modelappliesto the ogittransform,o the og of thePoisson mean and to thereciprocal f themean of the gamma distribution. ufficienttatistics re obtained and in veryfortunate ases useful "exact" significanceests for single regression oefficientsemerge.(v) Experimentalnd observational ata. Many of the issues discussed n thepaperapply essacutely o theanalysis f controlledxperimentshanto the nalysisof observational ata and that s whythepaper may seemoverweightedowardsthe atter ype fproblem. n fact, nterms f the discussionnthispaper,there rethree ather ifferenteasonswhy ewer ifficultiesrise n the nalysis fexperimentaldata, quiteapartfrom hesmaller andom rror o which uchdata are likely o besubject. These reasons re:(1) thespacingofregressorariables s likely o be moresuitable;(2) substantial on-orthogonalitiesf estimation illbe avoided;(3) factors mitted rom he treatments ill be randomized nd hencetheworstdifficultiesssociatedwith omitted ariables Section 2, point ii)) willbe avoided.

    ACKNOWLEDGEMENTI am gratefulo Mrs E. J. Snell and to the referees or onstructiveomments.REFERENCES

    AFIFI, A. A. and ELASHOFF,R. M. (1966). Missing observations in multivariate statistics. I.Review of the literature. J. Am. Statist. Ass., 61, 595-604.(1967). Missing observations in multivariate statistics. II. Point estimation in simplelinear regression. J. Am. Statist. Ass., 62, 10-29.ANSCOMBE, . J. (1961). Examination ofresiduals. Proc. 4thBerkeleyymp., , 1-36.(1967). Topics in the investigation f linear relations fitted y themethod of least squares.J. R. Statist. oc. B, 29, 1-52.

  • 8/3/2019 Cox Regression Methods

    15/16

    278 Cox - Notes onSome Aspects fRegression nalysis [Part3,BEALE,E. M. L., KENDALL,M. G., and MANN,D. W. (1967). The discarding fvariables nmultivariate nalysis. Biometrika, 4, 357-366.Box, G. E. P. (1966). Use and abuse of regression. echnometrics,, 625-630.Box, G. E. P. and Cox, D. R. (1964). An analysis f transformations.. R. Statist. oc. B,26, 211-252.Box, G. E. P. and TIDWELL, P. W. (1962). Transformationf the independent ariables.Technometrics,, 531-550.Cox, D. R. (1960). Regression nalysiswhen there s prior nformationbout supplementaryvariables.J. R. Statist. oc. B, 22, 172-176.(1961). Testsofseparate amilies f hypotheses. roc. 4thBerkeley ymp., , 105-123.(1962). Further esults n tests f separate amilies f hypotheses. . R. Statist. oc. B,24, 406-424.Cox, D. R. and HINKLEY,D. V. (1968). A note on theefficiencyf least squaresestimates.J.R. Statist. oc. B, 30, 284-289.DRAPER,N. R. andSMITH, . (1966). Applied egression nalysis.New York: Wiley.EHRENBERG, . S. C. (1963). Bivariate egressionsuseless. Appl. tatistics, 2,161-179.- (1968). Theelementsf aw-like elationships. .R. Statist. oc. A, 131,280-302.FISHER,R. A. (1956). StatisticalMethods nd Scientificnference. dinburgh: liver ndBoyd.FISK, P. (1967). Models of the second kind n regressionnalysis. J. R. Statist. oc. B, 29,266-281.GORMAN,J. W. and TOMAN, . J. 1966). Selection fvariables orfittingquations o data.Technometrics,,27-51.HOCKING,R. R. and LESLIE,R. N. (1967). Selection fthe best subset n regressionnalysis.Technometrics,, 531-540.HUBER,P. J. 1964). Robustestimation f ocation.Ann.Math. Statist., 5, 73-101.JEFFERS, . N. R. (1967). Two case studies n theapplication fprincipal omponentnalysis.Applied tatistics, 6, 225-236.KEMPTHORNE,. (1957). An ntroductiono Genetic tatistics.NewYork:Wiley.KENDALL,M. G. (1957). A Course n Multivariatenalysis.London: Griffin.KENDALL,M. G. andSTUART,A. (1967). AdvancedheoryfStatistics2nded.), Vol.2. London:Griffin.LEHMANN, . L. (1959). Testing tatistical ypotheses.NewYork: Wiley.LINDLEY,D. V. (1947). Regressionines nd inear unctionalelationships. . R. Statist. oc. B,9, 218-244.

    - (1968). The choiceofvariablesnmultiple egression. .R. Statist. oc. B, 30, 31-66.MADANSKY,A. (1959). The fittingf straightines when both variables re subject o error.J.Am.Statist.Ass.,54, 173-205.MIcKEYn,M. R., DUNN, 0. J.and CLARK,V. (1967). Note on theuse ofstepwise egressionndetectingutliers.Comp. ndBiomed.Res., 1, 105-111.MORAN,P. A. P. (1956). A test fsignificanceor n unidentifiedelation.J. R. Statist. oc. B,18,61-64.(1967). Testing orcorrelationetween on-negativeariates.Biometrika,4, 385-394.NELDER,J.A. (1968). Regression, model-buildingand invariance. J. R. Statist. oc. A, 131,303-315.NEWTON,R. G. and SPURRELL,D. J. (1967a). A developmentfmultiple egressionor theanalysis of routinedata. Applied Statistics, 16, 51-64.

    - (1967b). Examplesof the use of elements or clarifyingegression nalysis. AppliedStatistics, 6,165-172.PLACKETT,R. L. (1960). Regression nalysis.Oxford:Clarendon ress.- (1965). A class of bivariate distributions. J. Am. Statist.Ass., 60, 516-522.RAO,C. R. (1965). LinearStatisticalnferencend tsApplications. ew York: Wiley.SPRENT,P. (1966). A generalized least-squares approach to linear functional relationships.J.R. Statist. oc. B, 28,278-297.SPURRELL,D. J. (1963). Some metallurgicalpplications f principal omponents.AppliedStatistics, 2,180-188.TUKEY, .W. (1951). Componentsnregression. iometrics,, 33-69.(1954). Causation regressionand path analysis. In Statisticsand Mathematics in Biology(ed. 0. Kempthorne).owa: Ames.TURNER,M. E., MONROE,R. J.andLUCAS,H. L. (1961). Generalizedsymptoticegressionndnon-linearathanalysis.Biometrics,7,120-143.

  • 8/3/2019 Cox Regression Methods

    16/16

    1968] Cox- Notes on Some Aspects f Regression nalysis 279TURNER,M. E. and STEVENS, . D. (1959). Theregressionnalysis fcausalpaths. Biometrics,15, 236-258.WILK, M. B. and GNANADESIKAN,. (1964). Graphicalmethods or nternal omparisonsnmultiresponse xperiments. Ann. Math. Statist., 35, 613-631.WILLIAMS, . J. 1959). Regression nalysis.New York: Wiley.YATES, F. (1939). Tests of significance f the differences etween regression coefficients erivedfrom wosetsofcorrelatedariates.Proc.R. Soc. Edinb., 9, 184-194.