fisher's method of scoring · analysis problems. the main aim is to examine asp ects of the...

FISHER'S METHOD OF SCORINGM.R.OsborneStatistics Research SectionMathematical Sciences SchoolAustralian National UniversityAbstract. An analysis is given of the computational properties of Fisher's method of scoring formaximizing likelihoods and solving estimating equations based on quasi-likelihoods. Consistentestimation of the true parameter vector is shown to be important if a fast rate of convergence is tobe achieved, but if this condition is met then the algorithm is very attractive. This link betweenthe performance of the scoring algorithm and the adequacey of the underlying problem modellingis stressed. The e�ect of linear constraints on performance is discussed, and examples of likelihoodand quasi-likelihood calculations are presented.1. IntroductionTwo basic paradigms play important roles in the material developed in this paper. These are:(1) Newton's method for function minimization, and(2) the method of maximum likelihood for parameter estimation in data analysis problems.The main aim is to examine aspects of the structure and performance of Fisher's methodof scoring, a minimization technique based on the �rst paradigm, which provides a generalapproach to minimizing objective functions formulated using estimation strategies based onthe second. The context is one in which the structure of the objective function depends onassumptions of a stochastic nature made about the problem, and it will be important to showhow these a�ect the performance of the minimization algorithm.The main assumptions made about the estimation problem are:(1) The data set on which the estimation problem is based consists of event outcomes yt 2Rm; t 2 T where T contains the labelling information which identi�es the particularevents. Typically, yt is a vector of a measurements on a system, while T could be anindex set, the times at which particular observations are made, or a set of coordinatesdescribing the possible con�gurations of a piece of apparatus. Events associated withKey words and phrases. Maximum likelihood, Newtons method, law of large numbers, consistency, quasi-likelihood, method of scoring, line search, rate of convergence, bound constraints, multinomial likelihood..Typeset by AMS-TEX1

2 M.R.OSBORNEdistinct elements of T are assumed to be independent. The restriction to a single eventtype is made for simplicity.(2) A quantity N , the e�ective sample size, is associated with the data set. It is assumedthere is a systematic procedure which associates the particular data set with a realiza-tion of one of a family of experiments (each characterised by a distinct value of N) insuch a way that the limiting behaviour as N �!1 can be described. Let TN label theparticular event set. Then `systematic' means that there is a measureable set E � Rk,TN � E, and a increasing weight function !(u) de�ned on E such thatlimN�!1 1N Xt2TN �(t) = [Z 10 ]k�(t(u))d!(u) (1.1)where � is any continuous functional de�ned on E and t(u) maps [0; 1]k ! E. For exam-ple, consider N observarions taken at equispaced intervals of time such that t( iN ) = iN ,!(u) = u, and E = [0; 1]. Note that just the mechanism of sampling is being discussedhere. That the mechanism is appropriate, so that information in the system increaseswithout limit as N �!1, involves further assumptions.(3) The `true' probability density (mass for discrete distributions) associated with the eventyt is g(yt;�(t;��); t). Here �� 2 Rp is a vector of parameters which is to be estimatedfrom the given data, and �(t;�) : Rp�T �! Rq expresses the dependence on parametricand covariate data. That is � provides a model for the underlying data. It will alwaysbe assumed that quantities are smooth enough and that r�� has its maximum rank(this serves to get obvious regularity conditions out of the way).(4) The negative of the log likelihood function used for the estimation of � isK(�) =Xt2T �Lt; Lt = log(f(yt;�(t;�); t)) (1.2)where f is the density hypothesised for the event yt, and may not be the same as g(but it will have similar smoothness properties). Note that the same � appears in bothf and g so that the belief that a `true model' belongs to a particular parametric class ofpossible models is not questioned explicitly. However, such cases as �� = ��1��2 ; � =�1 � 0 are not excluded.Remark 1.1. Our developement of scoring is essentially from the same point of view as thatpresented in [1], [11], [12] in considering generalised linear models and the statistical language

FISHER'S METHOD OF SCORING 3GLIM. In this setting the implicit assumption is made that the appropriate limiting behaviourcorresponds to m=N �! 0 as N �! 1. Together with the assumption of the independence ofthe events yt, this makes practical an important connection between the basic scoring step (4.1)and a linear least squares problem (4.19). This provides an e�ective method for implementingthe scoring algorithm with distinct numerical advantages in terms of scaling and stability. Italso puts an emphasis on the form of the sampling strategy de�ned by TN , and in subsequentargument involving appeals to the law of large numbers it will be assumed that appropriateregularity conditions are satis�ed. A minimum condition on T is that it serves to identify �in the case that yt, t 2 T is given without error.Example 1.1. One important model for the event data yt is that of a signal observed in thepresence of noise. In this case it is assumed that there is a parametric model for the signalgiven by Eg(t;��)fytg = �(t;��) (1.3)where E is the expectation operator (for the density speci�ed), and the mean vector �(t;�),� 2 B, describes the parametric family of possible signals. Typically, � = � in (1.2).The purpose of this paper is to show not only that scoring provides a very satisfactory com-puting algorithm, but also that the performance of the algorithm provides important insightsinto the underlying problem modelling. This is done by establishing a connection between thetwo questions of the convergence of the computed estimates ��N as N !1, and the asymptoticrate of convergence of the scoring algorithm. This link is provided by Newton's method.The plan of the paper is as follows. The necessary properties of Newton's method for func-tion minimization are sketched in the next section. It is used to show consistency of maximumlikelihood and quasi-likelihood estimates in section 3. The case for scoring is presented insection 4. The argument in [16], which shows that the natural extension of the Levenbergalgorithm for nonlinear least squares to scoring possesses both good global convergence prop-erties and a fast rate of convergence in the case of likelihoods based on the exponential familyof distributions provided the problem data is adequate, extends to the more general case (1.2)with only minor modi�cation. For this reason, discussion is concentrated on the numericalproperties of quasi-likelihood based estimation procedures. These have the novel propertythat it is the derivative of the objective function which is directly available rather than thefunction values themselves, and this consideration a�ects the numerical detail of the method.

4 M.R.OSBORNEThe emphasis on the rate of convergence of the scoring method complements the discussion ofquasi-likelihoods in [11] and exponential dispersion models in [8]. It does this by stressing thenexus between e�cient performance of the scoring algorithm and adequacey of the problemmodelling. In other words, if scoring does not work well then the results are likely not of inter-est. The treatment of linear constraints on � is considered in section 5, and some numericalresults are presented in section 6. 2. Newton's methodIn this section properties of Newton's method for function minimization are summarised asa preliminary to a comparison with Fisher's method of scoring. The method seeks to estimatea stationary point of the log likelihood by making a quadratic approximation to the objectivefunction at the current point. Let J = r2K(�); (2.1)then a step of the iteration is given by � � + hwhere h = �J (�)�1rK(�)T : (2.2)Also, it is usual to stabilize Newton's method by making a linesearch step using a merit functionQ(�) which has the same minimum as K(�), and which satis�es the descent conditionrQ(�)h < 0 (2.3)whenever h and rQ(�) 6= 0. The idea is that the step actually achieved in � should reduceQ and also satisfy some further criterion of e�ectiveness. The criterion discussed in [16] willserve the purpose here. It sets (�;h; �) = Q(� + �h)�Q(�)�rQ(�)h (2.4)and updates � by � � + �h (2.5)where � is chosen to satisfy the conditions0 < � < (�;h; �) < 1� �; (2.6)

FISHER'S METHOD OF SCORING 5where typically � is chosen small (say 10�4).Newton's method has the particular advantages:(1) It has a fast (second order) rate of ultimate convergence to a zero b� of rK providedJ is nonsingular (but this does not imply convergence to a minimum of K).(2) It is invariant under constant, nonsingular, linear transformations. Close enough to thesolution it is invariant to �rst order under general, nonsingular transformations.However, there are also potential problems with Newton's method:(1) Selection of a suitable monitor may not be straightforward. For example, it is notnecessarily true that K is reduced in making a small enough step in the directiondetermined by h (a su�cient condition is that r2K be positive de�nite). Thus thesimple tactic of setting Q = K in (2.4) may not be available for the Newton correction.Also, while h is downhill in the sense of (2.3) for minimizing Q = krKk2 which isminimized at all stationary points of K; the use of this modi�ed objective function canintroduce further di�culties (the scaling may not be appropriate [2]). A possibilitythat needs to be explored further is that negative curvature in K is informative, andthat the appropriate action needs to take this into account.(2) Calculation of h requires a knowledge of second derivatives of K. But often it isbelieved to be either uneconomical or inconvenient to compute these. For this reasonit is conventional wisdom to seek methods which use at most �rst derivatives.Newton's method has an advantage of a di�erent kind in that it can be used to proveexistence theorems. Consider, for example, the iteration started from an initial point �0contained within some ball B = f�; k� � �0k < �g in which something is known (or is goingto be assumed) about the regularity properties of the problem data. If(1) K is twice continuously di�erentiable, and kJ (�) � J (�0)k � K1k� � �0k for all�;�0 2 B,(2) kJ (�0)�1k � K2, kJ (�0)�1rK(�0)T k � K3,(3) � = K1K2K3 satis�es � < 12 ,(4) � > � = 1� (1�p1� 2�)K3,then the full step Newton's method converges to a point b� 2 B, and b� is the only root ofrK(�) = 0 in B provided � is near enough to � .Remark 2.1. . Note that � can be chosen small if K3 is small and the other assumptions hold

6 M.R.OSBORNEwith moderate values for the constants. Thus, if �0 gives a small value of K3 then b� is closeto �0 (certainly inside a ball of radius 2K3). An application of this result is made in the nextsection in discussing the convergence of the minima of KN as N �! 1. The idea is to use thestochastic properties to identify a limiting value �L. This is then used as �0 in showing that�N and �N are small provided N is large enough.3. ConsistencyConsistency has to do with the question of how the parameter estimates b�N obtainedby minimizing (1.2) behave as N �! 1. Newton's method �nds stationary points so theassumption of a minimizing sequence involves a further condition. Here it is essential becausethe standard consistency results for maximum likelihood are results about global minima. Afurther question concerns the manner in which convergence behaviour is a�ected if the trueprobability density of the events yt is unknown and guessed incorrectly; and this leads in turnto a third question which asks what kind of procedure can be employed to ensure that correctlimiting behaviour of the computed estimates is not too critically dependent on guessing theright probability structure.It is convenient to consider these questions here:(1) because the convergence result for Newton's method permits a simple discussion ofconsistency (but one limited by quite strong smoothness requirements). A discussionunder weaker conditions is given in Huber[6].(2) because this approach provides the results in a form which permits them to be relateddirectly to the numerical performance of the scoring method.Remark 3.1. To discuss consistency it is necessary to identify �rst an appropriate limitingvalue �L (assumed to exist). We associate the two ideas b�N �! �L in the sense of almostsure convergence, and b�N is a consistent estimator of �L under the assumed probability model.Also it is important to characterise the case �L = ��. Here we will say that b�N is consistentwithout further quali�cation.Assume that the true density is g. Then, indicating dependence on the current sample by

FISHER'S METHOD OF SCORING 7subscript N and on the labelling within the sample by subscript t,r�KN (�) = �Xt2TN 1ftr�ftr��t= �Xt2TN� 1ftr�ft � Eg(t;��)� 1ftr�ft��r��t� Xt2TN Eg(t;��)� 1ftr�ft�r��t (3.1)Now, assuming that the law of large numbers can be applied to the �rst summation in (3.1),this gives (compare (1.1))1Nr�KN ! �[Z 10 ]kEg(t;��)� 1ftr�ft�r��td!(u) a:s: (3.2)where !(u) is de�ned in (1.1) and describes the limiting distribution of data events and t = t(u).Thus the relevant limiting value �L for the minimizers of KN for �nite N solves the system ofequations [Z 10 ]kEg(t;��)� 1ftr�ft�r��td!(u) = 0: (3.3)Similar expressions can be found in [10], [20]. Two important cases in which it is possible todeduce from (3.3) that �L = ��, the true vector of parameters, are:(1) Let r�Lt = Vt(�)�1fyt � �t(�)g (3.4)where Vt is positive de�nite. An important class of examples corresponds to f a memberof the exponential family of distributions, and in this case Vt(�) = V (�t).(2) Let f = g. Then the standard identityEffr�Ltg = 0 (3.5)shows that �L = �� provided �� is an isolated solution of (3.3) and we recover theusual result.To examine conditions under which b�N ! �L a:s:, assume the smoothness conditionsneeded for the Newton's method convergence theorem and set �0 = �L in the �rst step for�nding a root of 1NrKN (�). Then it follows (compare Remark 2.1) that a su�cient conditionfor �N ! 0; a:s:; N !1 ) b�N ! �L a:s:

8 M.R.OSBORNEis that 1NrKN (�L)! 0; 1N JN(�L)! bounded, positive de�nite; a:s:; N !1:The limit of the �rst term is (3.3). To analyse the second term note that the strong law nowgives 1N JN(�L)� 1N Eg(��)fJN (�L)g �! 0 a:s: (3.6)But 1N Eg(��)fJNg = 1N Ef(�L)�JN� 1N Xt2TN Zrange y�g(t;��)� f(t;�L)�r2Ltdy= 1N Xt r��Tt Ef(t;�L)�r�Lt(�L)Tr�Lt(�L)r��t� 1N Xt2TN Zrange y�g(t;��)� f(t;�L)�r2�L(�L)dy (3.7)where use has been made of the identity�E� @2Lt@�i@�j� = E�@Lt@�i @Lt@�j �: (3.8)The �rst term on the right hand side of (3.7) has the limiting value[Z 10 ]kr��Tt Vt(�L)�1r��td!(u) (3.9)where Vt(�)�1 = Ef(t;�)fr�Lt(�)Tr�Lt(�)g: (3.10)The second term has the limiting value�[Z 10 ]k�Zrange y(g(t;��)� f(t;�L))r2�Lt(�L)dy�d!(u) (3.11)It is clear that (3.9) will fail to meet the bounded, positive de�nite requirement only underrather unusual circumstances so that b�N being a consistent estimate of �L under the hypothe-sised model will be ensured provided the relative size of the contribution of (3.11) is su�cientlysmall. That is provided the averaged contribution of the di�erence g(��)�f(�L) is su�cientlysmall. Of course, if the Hessian is inde�nite a.s. for N large enough then the correspondingstationary points cannot be minima and the whole discussion loses its point. In practice,positive de�niteness of J will likely be recognised from the performance of the minimizationalgorithm (this point is made explicit in the next section).

FISHER'S METHOD OF SCORING 9In many cases it is not di�cult to construct consistent estimators of ��. The idea is to copythe form of Kn and prescribe the quasilikelihood function W =Pt2T �Wt by starting with agradient speci�cation analogous to (3.4)r�WTt =Wt(�)�1�yt � �(�; t)� (3.12)where Wt(�) is assumed to be positive de�nite. Note that this means that it is derivativerather than function values of W that are immediately available. Provided (1.3) holds, itfollows from (3.12) that Eg(t;��)fr�Wt(��)g = 0: (3.13)Thus, if b�N satis�es the estimating equationr�W =Xt2T r�Wtr��(�; t) = 0 (3.14)then the preceding argument goes through essentially unchanged to show that b�N ! �� a:s:,N !1, where �� satis�es (1.3) provided the law of large numbers applies to show that1Nr�WN (��)! 0 a:s:; N !1; (3.15)and 1Nr2�WN = 1N Xt fr��Tt W�1t r��t� (r��Tt r�W�1t +r2��Tt )(yt ��t)g! [Z 10 ]kr��Tt W�1t r��td!(u)= bounded, positive de�nite: (3.16)Remark 3.2. It is not di�cult to satisfy these conditions for reasonably general distributionsof errors. In particular, it is not necessary for Wt to be a consistent estimate of variance.One familiar and important example corresponds to Wt = I, the case of (nonlinear) leastsquares. It follows that this estimating procedure is consistent under typical conditions (ade-quate smoothness, density with bounded second moments, and sampling distribution compat-ible with (3.15),(3.16)). It is these latter conditions which contain the real substance of thisresult which is due to Jennrich [7].Remark 3.3. The original suggestion of starting with the estimating equation (4.14) is due toWedderburn [19]. A detailed discussion is given in [12]. One consequence of the easy access to

10 M.R.OSBORNEa consistent estimator is a loss of e�ciency. This can be minimised by an appropriate choice ofWt, and this question has received some attention. Wedderburn recommended Wt(�) = Vt(�)as providing a generalisation of maximum likelihood estimation for the exponential family inthe case of scalar y. Extension to the multivariate case is considered in [11]. Zeger and Liangdiscuss the use of a consistent estimate of Vt in quasilikelihood modelling of longitudinal datain several papers (for example [21]).Remark 3.4. An alternative is to start with rW satisfying (3.5). This is the basis for a dis-cussion of optimal estimating equations in, for example, Godambe and Heyde [4] and McLeishand Small [13]. 4. Fisher's method of scoringThe characteristic feature of the scoring algorithm is its replacement of the exact Hessianr2K in the Newton step (2.1) by its expectation. Thus the scoring step has the particularform h = �I(�)�1rK(�)T (4.1)where r�K(�) =Xt �r�Ltr��t;r2�K(�) =Xt ��r��Tt r2�Ltr��t �r2��tr�Lt;and I = Ef�r2�K(�) =Xt r��Tt Vt(�)�1r��t; (4.2)where Vt(�) is given by (3.10), and use has been made of the standard identities (3.4) and(3.8).It follows immediately that:(1) I is positive (semi) de�nite so that rKh < (=)0 showing that the scoring step isnecessarily downhill for minimizing K when I is nonsingular, and that the choice ofmonitor function Q = K is available in (2.4) - a distinct advantage over Newton'smethod,(2) scoring has the same transformation invariance properties as Newton's method, and(3) scoring requires only �rst derivative information for its implementation.

FISHER'S METHOD OF SCORING 11On the other hand it is necessary to compute expectations, and this could prove a di�cultybecause numerical evaluation of the integrals required in (4.2) does not appear practical for con-tinuous distributions in general. However, in many cases the results needed are known exactly(the exponential family of distributions and their generalisations [8] providing an importantclass of examples).But the potential problems of computing expectations does provide another reason forconsidering estimating equations based on the modi�ed score function (3.12). In this casewe set Is(�) =Xt r��Tt W�1t r��t (4.3)so that Is(��) = Eg(��)�r2�W(��) (4.4)and de�ne the scoring step by h = �Is(�)�1r�W(�)T : (4.5)Most of the good features of the scoring algorithm recorded above extend to this modi�edversion. However, the line search is complicated by the lack of a monitor function in general.An alternative is to determine the step length parameter � by �nding the closest zero of thedirectional derivative by solving W 0(� + �h : h) = 0: (4.6)This equation may have to be solved fairly accurately to ensure that the stability conditionthat W be reduced in the current step is satis�ed.The above discussion shows that scoring inherits most of the good properties of Newton'smethod, and the most serious remaining questions concern its convergence behaviour. Here toothere are satisfactory answers. To summarise these assume that (4.1) is used to generate thescoring step, the monitor function is Q = K, the step � is chosen to satisfy (2.4), and successiveiterates are con�ned to a compact region D � Rp (for example, D could be bounded by a closedequipotential) in which I has full rank and r3K is continuous. Then the standard arguments[16] show that(1) rKh=krKkkhk < �1=cond(I) (4.8)

12 M.R.OSBORNE(where cond(I) = kIkkI�1k is the condition number of I) so that the descent directiongiven by the scoring algorithm is uniformly downhill for minimizing K, and(2) limit points of the iteration are stationary points of K.Similar results hold for the scoring step based on (4.5) with Q =W as an (implicitly de�ned)monitor corresponding to the strategy based on �nding a zero of (4.6).Remark 4.1. Note that I having full rank in D is not too di�cult a global condition. It doesnot imply that J has full rank so that it does not su�ce to guarantee a single stationary pointin D. It may suggest superior global convergence properties for the scoring algorithm.To make further progress it is convenient to assume convergence to a local minimum b�which is a consistent estimate of the true parameter vector ��. This need not be too extremean assumption in the likelihood case when the objective function can have good convexityproperties [18]. The rate of convergence can now be estimated using essentially the sameargument as in [16], [17]. For this reason only the estimating equation case is considered here.The argument goes in two stages. In the �rst it is shown that � = 1 will satisfy the stepchoice criterion when the current iterate � is close enough to b�N and N is large enough. Inthe second stage it is assumed that � = 1 is allowable, and the rate of convergence of the fullstep method is discussed.In the �rst stage note that � satis�es (4.6) when khk is small provided0 = rWN(�)h+ �hTr2WNh+O(khk3);= hT �(�� 1)r2WN +r2WN � IsN�h+O(khk3): (4.9)We assume that 1Nr2WN (�)! bounded, positive de�nite;while the standard argument invoking the law of large numbers now shows that (almost surelyas N !1) 1N �r2WN (�)� IsN(�)� = o(1) +O(k� � ��k)so that the term involving (� � 1) dominates the bracketed term in (4.9). It follows that theminimizing � is close to 1.Remark 4.2. There is a corresponding result in the likelihood case { that � = 1 can be usedeventually when N is large enough. But now (2.4) provides a criterion which permits us to

FISHER'S METHOD OF SCORING 13test if � = 1 is acceptable. We do not have access to a corresponding test when values of Ware not available.In the second stage of the rate of convergence result the iteration (4.5) is written in �xedpoint form as �i+1 = F (�i); F (�) = � � Is(�)�1r�W(�)T : (4.10)It is a standard result that b�N is a point of attraction for this iteration provided the spectralradius of F 0 (written $(F 0)) satis�es $�F 0(b�N )� < 1 (4.11)Because r�W(b�N ) = 0 we have (using (4.4) and the familiar argument)F 0N (b�N ) = I � IsN (b�N )�1r2xWN(b�N )= � 1N Is(b�N )��1� 1N �IsN (b�N )�r2xWN (b�N )��= F 0N (��) +O(kb�N � ��k) a:s:; N !1 (4.12)where F 0N (��) = o(1) a:s:; N !1: (4.13)Thus $�F 0N (b�N )�! 0 a:s:; N !1: (4.14)This shows that the rate of convergence of the scoring algorithm increases as the e�ectivesample size N increases. The consequence is that scoring is an attractive algorithm, in thesense that it has a fast rate of ultimate convergence, provided the measure N of the e�ectivesample size is large enough.Remark 4.3. When the choice f 6= g is made in the de�nition of K then the scoring algorithmuses as Hessian estimate I(�) = Ef(�)fJ (�)g (4.15)Arguing as in the derivation of (3.6), (3.7), the e�ect of the misidenti�cation is to give thealmost sure limiting valueF 01(�L) = �[Z 10 ]kr��Tt Vt(�L)�1r��td!(u)��1�[Z 10 ]k�Zrange y(g(t;��)� f(t;�L))r2Lt(�L)dy�d!(u)� (4.16)

14 M.R.OSBORNEComparing (4.16) with (3.6), (3.7) it will be seen that the condition for �L to be an attractive�xed point for the scoring algorithm implies the condition for J (�L) , the sample information,to be positive de�nite. This follows from the result that if A is positive de�nite and B issymmetric then $fA�1Bg < 1) A�B positive de�nite:Thus there is a very close connection between the ultimate convergence of the scoring algorithmwith a unit step in the line search, and the condition for b�N to be a consistent estimator for�L. It follows from (4.16) that misidenti�cation of the density can explain poor performanceof the scoring algorithm for large N , in particular, when b�N ! �L 6= �� a:s: N �!1.Remark 4.4. An alternative way to look at the above results is that if the scoring algorithmconverges rapidly then there is direct evidence that the problem modelling is adequate. It isknown that $�F 0(��)� is an invariant characterising the local nonlinearity of the likelihoodsurface (see, for example, [9]). If it is small then con�dence intervals based on linear theory willbe adequate for the parameter estimates. Clearly knowledge of $�F 0(��)� is of value, and thiscan frequently be estimated from the ratiokhi+1k=khik in the case that the largest eigenvaluein modulus of F 0(��) is isolated. This just corresponds to an application of the classical powermethod for determining eigenvalues and will work best when the convergence rate is slow - butin the contrary case it will return a small number as required.Remark 4.5. Both the forms (4.1) and (4.5) of the scoring algorithm possess the property thatthe calculation of the scoring step can be cast in the form of a linear least squares problem.This point is useful in implementation. Consider (4.5), de�neAt = r��Tt W� 12t ; (4.17)and set A = 264 AT1...ATjT j375 ; b = 264 (W� 121 )T (y1 � �1)...(W� 12jT j )T (yjT j � �jT j)375 : (4.18)Then (4.5) can be written min rT r; r = Ah� b: (4.19)Note that this system has a standardised right hand side at � = �� if Wt is the correctcovariance matrix. This suggests that this formulation corresponds to an e�cacious scaling.

FISHER'S METHOD OF SCORING 155. Scoring under constraintsDiscussion of general constrained problems is not attempted. But, as a general rule, localbehaviour can be predicted on the basis of quadratic approximation of the objective functionand linear approximation of the active constraints; and linear constraints of one form or anotherare never too far away from the class of problems we are considering. The point to be made hereis that scoring usually maintains its good properties when applied to constrained problems.Examples of constraints include:(1) constraints de�ning discrete probability distributions. For example:�i � 0; i = 1; 2 � � � ;m; mXi=1 �i = 1; (5.1)(2) constraints imposed to ensure identi�ability in analysis of variance calculations. Forexample: let g(�ij) = m+ ai + bj ; i = 1; 2 � � � na; j = 1; 2 � � � nb; (5.2)then, typically the conditionsnaXi=1 ai = 0; nbXj=1 bj = 0 (5.3)are imposed to remove the obvious ambiguities.(3) constraints can express additional information about the problem. For example, infor-mation that some parameters must be non-negative could be the expression of somephysical law. Constraints of this kind are called bound constraints;(4) Sometimes it is convenient to �x certain components of the parameter vector � inexploratory calculations involving controlling and setting parameter values in testinga range of models. Scoring may not be too robust with respect to this kind of activitywhich could introduce inconsistency if the complexity is understated, and exaggerateany tendency towards ill-conditioning in I if it is overstated.Frequently problems of the �rst kind are eliminated by introducing a suitable parametrization(this point is illustrated in the next section where we consider an application involving a multi-nomial likelihood). Linear equality constraints (for example (5.3)) can be used to eliminatevariables, but this can cause �ll-in in sparse matrix structures such as (5.2) and so may notbe an optimal strategy. To discuss the e�ect of linear equality constraints on the performance

16 M.R.OSBORNEof the scoring algorithm consider the constrained maximum likelihood problem (there is acorresponding quasi-likelihood development)min� KN (�) subject to C� = d: (5.4)The necessary conditions for a minimum give the equationsr�KN (�) = zTC (5.5)C� = dwhere z is the vector of multipliers. Setting � = z=N , then the scoring step is given by�� + � �h�h� �where the descent direction satis�es the systemINh� �NCTh� = �rKTN (�) +NCT� (5.6a)�NCh� = N(C� � d) (5.6b)To discuss the rate of convergence note that the appropriate limiting system corresponding tothe N �!1 limit in (5.5) is [Z 10 ]kEg(t;��)fr�Ltgd!(u) = ��TC (5.7)C�� = dArguing as in (4.10) - (4.14), the rate of convergence depends on$�� 1N IN �CT�C 0 ��1 � 1N (JN � IN) 00 0 ��:It gets small as N �!1 under similar conditions. Now consider the factorizationCT = Q �U0 � (5.8)where Q = [Q1 Q2 ] is orthogonal and U is upper triangular. Substituting in (5.6) gives, inparticular, QT2 INQ2hc = �QT2r�Kn (5.9)provided � satis�es the equality constraints, wherehc = QT2 h� (h� = Q2hc) (5.10)

FISHER'S METHOD OF SCORING 17and this can readily be put into the form (4.19). Again, a similar development holds forquasi-likelihoods.Similar results can be obtained for bound constraints. Major changes are not needed inthe consistency arguments (they are based on the distribution of y rather than of �) althoughan extension of the Newton result is needed and second order su�ciencey becomes the rightproperty in showing �� is isolated. A discussion of consistency for bound constraints undermuch weaker conditions is given in Moran [14] . He derives the distribution of � in Moran [15].There is no real restriction in assuming lower bounds only, and in this case the maximumlikelihood problem becomes min�2BK(�); B = f�;� � bg (5.11)The Kuhn-Tucker conditions characterizing the optimum arerK(�) = uT ; (5.12a)u � 0; ui(b�i � bi) = 0; i = 1; 2 � � � ; p: (5.12b)Let �(�) = fi;�i > big be an index set pointing to the variables not at their bounds, and letP = �H�K� � be a permutation matrix with the property thatH�� = ��; K�� = ��c (5.13)separates � into its �xed and free components. Then the scoring step constrained by thecondition K�� = 0 is given by the solution of the system of equations (compare (5.6))Ih+rKT �KT� � = 0 (5.14a)K�h = 0; (5.14b)and this can be put in the equivalent formh = �HT� (H�IHT� )�1H�rKT : (5.15)Here � can be interpreted as a set of tentative Kuhn-Tucker multipliers, and optimality corre-sponds to rKT = KT� �; � � 0: (5.16)

18 M.R.OSBORNEThe algorithm could proceed as follows:(1) compute h, � at the current point(2) if �q < 0 where q argmini �ithen � � [ fqgrecompute h(it can be shown that this de�nes a feasible direction)else stop if kH�rKT k is small enough(3) Let �1 satisfy the step criterion (2.4)q argmini(�;�i + �hi = 0; � > 0; i 2 �)�2 ��q=hqif �2 < �1 then � �2; � � n fqgelse � �1� � + �hrepeat (1).To analyse the rate of convergence of the algorithm assume that b� 2 S � B where Spossesses the properties(1) S is small so that � 2 S ) k� � b�k � �.(2) S is bounded by an equipotential, S = f�; K(�) � eKg \ B.(3) S intersects only constraints active at b� so that(�i = bi;� 2 S) �! b�i = bi:(4) A form of strict complementarity implying a condition on S:b�i = bi ) (rK(�))i > 0; 8� 2 S: (5.17)It follows from (3), (4) thatH�rK(�)T = 0) � = b� because � 2 S ) �(�) � �(b�):Lemma 5.1. If �(b�) n �(�) 6= ; then h is downhill for minimizing K, and the minimum withrespect to � in S of K(� + �h) is attained at a point where at least one more component of �is at a bound.Proof. From the above assumptionsH�rK(�) = H�u+O(�) = O(1); � 2 S; �(b�) n �(�) 6= ;;

FISHER'S METHOD OF SCORING 19so that rK(� + �h)h = �uTHT� (H�IHT� )�1H�u+O(�)< 0; � � �2: (5.18)It follows from assumption (2) above that K is minimised in S for � = �2 so that the constraint�q = bq becomes active.Corollary 5.1. If the scoring algorithm is started from a point � 2 S then in a number ofsteps bounded by the number of constraints active at b� the correct active set is located.This result is reported in [17]. A similar result has been given in Burke and Mor�e [3].Remark 5.1. It now follows from assumption (d) that constraints are never deleted once � 2 S.The constraint set is built up to �(b�) in a �nite number of steps and the iteration then proceedsas an unconstrained problem in the reduced variables z = H�(b�)�. The rate of convergence ofthe scoring method is now given by$(F 0) = $�(H�IHT� )�1(H�(I � J )HT� ):This quantity is small under the same conditions that obtain in the unconstrained case.6. Numerical ExamplesHere numerical results illustrating a number of the points made in previous sections aregiven. Two problems are considered. The �rst is an example of a multinomial likelihoodand illustrates a successful application of the scoring method. The second example is a quasi-likelihood calculation taken from Wedderburn's original paper [19]. Here the scoring method isonly moderately successful and gives rise to some doubts about the modelling of the problem.Both examples involve constraints of the types previously considered. In the likelihood prob-lem they are removed by reparametrization. In the quasi-likelihood problem which is solvediteratively it is convenient to use the transformed subproblem (5.9).The multinomial likelihood This distribution occurs when a sequence of n independent andidentical experiments is performed. If each experiment results in one of m possible outcomeswith respective probabilities �i � 0; i = 1; 2 � � � ;m; Pmi=1 �i = 1, thenP�Yi = yi; i = 1; 2 � � � ;m = n!y1! � � � ym!�y11 � � � �ymm (6.1)

20 M.R.OSBORNEwhere yi is the number of times that the i'th outcome occurred and Pmi=1 yi = n. Considernow the case in which the multinomial distribution governs the outcome of n(t) trials for eacht, t 2 T . Here the yi(t) are observed and the quantities to be modelled are the �i. Assumingthat the outcomes of each group of trials are mutually independent then the log likelihood isadditive, and the contribution from each group of trials is given byLt = mXi=1 yi(t)log(�i(t)) + const:; (6.2)and this can be put in the form (1.2) by setting(�t)i = �i(�; t); i = 1; 2 � � � ;m� 1; �m(�; t) = 1� eT�t: (6.3)The reduction of a scoring step to a linear least squares problem is straightforward in this case.Di�erentiating (6.2) and taking expectations where appropriate gives@Lt@�i = yi(t)(�t)i � ym(t)1� eT�t = yi(t)�i(t) � ym(t)�m(t) ; i = 1; 2 � � � ;m� 1; (6.4)@2Lt@�i@�j = � yi(t)�i(t)2 �ij � ym(t)�m(t)2 ; i; j = 1; 2 � � � ;m� 1; (6.5)E� @2Lt@�i@�j� = � n(t)�i(t)�ij � n(t)�m(t) ; i; j = 1; 2 � � � ;m� 1: (6.6)Thus V �1t in (3.10) is given byV �1t = n(t)�diag(1=�i(t)) + (1=�m(t))eeT= n(t)D 12t �I + (1=�m(t))vvT �D 12t (6.7)where Dt = diag(1=�i(t); i = 1; c � � � ;m� 1); vt = D� 12t e: (6.8)Then one form for V � 12t is V � 12t = n(t) 12D 12t (I + �tvtvTt ) (6.9)where �t = 1�m(t) + �m(t) 12 :The quantities de�ning the linear least squares form of the scoring step (4.19) can now beevaluated. This gives At = n(t) 12 (r��Tt D 12t + �t(r��Tt e)vTt )= n(t) 12 (r��Tt D 12t � �tr��m(t)TvTt ); (6.10)

FISHER'S METHOD OF SCORING 21and, noting that(I + �tvtvTt )�1(D 12t y(t)� (ym(t)=�m(t))vt)= D 12t y(t)� �t(ym(t) + n(t)�m(t) 12 )vt;bt = n(t) 12D� 12t (Dty(t)=n(t) � �t(ym(t)=n(t) + �m(t) 12 )e); (6.11)where y(t) is the vector with components (y1(t); � � � ; ym�1(t)). In this case the components ofbt have a scale of pn(t)�i(t) =pEfyi(t)g, while the untransformed quantities @Lt=@�i havea corresponding scale of n(t).Remark 6.1. Here the sampling strategy must combine aspects of the two following possibilities,both of which can be relevant to the convergence analysis:(1) Each n(t) ! 1; t 2 T uniformly in the sense that n(ti)=n(tj) is bounded for allti; tj 2 T , but jT j is �xed (it needs to be su�ciently large for the problem to beidenti�able for �). Then$�I�1�I � J �! 0 providedyi(t)=n(t)! �i(t) > 0; i = 1; 2 � � � ;m; t 2 T:(2) Each n(t) is limited in size by a condition of the form n(t)=N � K=jT j but nowjT j ! 1.An example of trinomial data (m = 3) is given in Table 6.1. It derives from a study of thee�ects of a cattle virus on chicken embryos.Table 6.1 near hereThe model �tted to this data is�1 = 1=(1 + exp(��1 � �3log(t)));1� �3 = 1=(1 + exp(��2 � �3log(t)));�2 = 1� �1 � �3: (6.12)Starting values were provided with the data. Results obtained using the scoring algorithmimplemented in the manner described in [16] are given in Table 6.2.Table 6.2 near here

22 M.R.OSBORNEIt will be seen that the performance of the algorithm is very satisfactory. It is stressed that thefast rate of convergence achieved provides persuasive evidence also that the modelling strategyis satisfactory.The direct application of scoring to multinomial data is discussed in Green [5]. The abovepresentation gives A and b explicitly in (4.19). The resulting algorithm appears to be supe-rior to the procedure described in [12] for use with the statistical package GLIM. A recentapplications oriented account of the use of this package is given in [1].Quasi-likelihood calculation The quasi-likelihood calculation reported here follows the problemanalysis presented in [19]. In Table 6.3 percentage of leaf blotch infection (R.secalis) is givenfor 10 varieties of barley grown on 9 sites.Table 6.3 near hereWe let yij be the proportion of leaf blotch at site i for variety j. The quasi-likelihood calculationis de�ned by setting �ij = Efyijg; i = 1; 2 � � � ; na; j = 1; 2 � � � ; nb and positing the modellogit(�ij) = log� �ij1� �ij� = m+ ai + bj (6.13a)var(yij) = �2ij(1� �ij)2: (6.13b)To use the mean model (4.13a) it is necessary to add identi�ability constraints as in (5.3). Thereduction of the linear subproblem to (5.9), (5.10), and hence to (4.19) is straight forward inthis case as the rows of the constraint matrix C are orthogonal. And a further simpli�cationoccurs because of the choice of model. Letwij = m+ ai + bj ;then �ij = ewij1 + ewijand the ij'th row of eA, where A = eAQ2, is given byeaij = 1var(yij) 12 d�ijdwijr�wij= r�wij :This corresponds to a balanced design so that h� can be written down immediately. In thenumerical results reported below, the secant algorithm was used to �nd an approximate zero

FISHER'S METHOD OF SCORING 23of (4.6), and a standard linear least squares �t to the data used to set initial values. Thebehaviour of the iteration is detailed in Table 6.4 which gives, for each iteration, DX = khk,DF = W 0(� : h), the number of steps needed to bracket the minimum NB, the numberof secant algorithm steps NR, and the computed step XINC. The convergence test forterminating the algorithm is based on the size of DX. The test on the linesearch is lessextreme and takes account both of the size of DX and the convergence of XINC.Table 6.4 about hereThe secant algorithm is quite well behaved with the root being bracketed quickly, and the choiceof a unit scale being clearly satisfactory. Equally clearly the calculation could be economized(taking � = 1 in the line search step except possibly in the �rst iteration would almost certainlybe satisfactory). However, the overall rate of convergence is not very fast, and with the scoringalgorithm this is a good indication of shortcomings in the modelling procedure. In this case thepower method gives a useful indication of the limiting value of $. The successive ratios aregiven in the �nal column of Table 6.4. The residuals in the �t are displayed in Table 6.5.Table 6.5 about hereAcknowledgementI am indebted to Dorothy Anderson for helpful discussion on the application of scoringmethods, and, in particular, for the data for the trinomial example. Also to Terry Speed for acareful reading of the paper and many constructive suggestions regarding presentation.Resum�e. Nous analysons la m�ethode de Fisher de compter pour maximer des probabilit�es etpour resoudre des equations fond�ees sur le quasi-probabilit�e. Nous montrons qu'une estima-tion consistante de la vecteur des variables est n�ecessaire pour obtenir une convergence vite,mais si cette condition est satisfaite, cet algorithme serait tr�es e�ectif. Cette connexion entrela performance de l'algorithme de compter et la su�sance de la mod�ele de la probl�eme estaccentu�ee. Nous discutons l'e�et des constraints lin�eaires, et nous pr�esentons des examples dela calculation du probabilit�e et du quasi-probabilit�e.References1. M. Aitkin, Dorothy Anderson, B. Francis and J. Hinde, Statistical Modelling in GLIM, Oxford StatisticalScience Series, vol. 4, Clarendon Press, 1989.

24 M.R.OSBORNE2. U. Ascher and M.R. Osborne, A note on solving nonlinear equations and the natural criterion function,J.O.T.A. 55 (1988), 147{152.3. J.J. Burke and J.J. Mor�e, On the identi�cation of active constraints, Tech. Mem. no.82, Mathematics andComputer Science Division, Argonne National Laboratory.4. V.P. Godambe and C.C. Heyde, Quasilikelihood and optimal estimation, Int. Stat. Rev. 55 (1987), 231{244.5. P.J. Green, Iteratively reweighted least squares for maximum likelihood estimation, and some robust andresistant alternatives, J.R.S.S. B 46 (1984), 149{192.6. P.J. Huber, The behaviour of maximum likelihood estimators under nonstandard conditions, in "proceedingsof Fifth Berkely Symposium on Mathematical Statistics and Probability", Vol. 1 (1967), University ofCalifornia Press, Berkely.7. R.L. Jennrich, Asymptotic properties of nonlinear least squares estimators, Ann. Math. Stat. 40 (1969),633{643.8. B. Jorgenson, Exponential dispersion models, J. R. Statist. Soc B 49 (1987), 127{162.9. R.E. Kass and G.K. Smyth, The rate of convergence of the Fisher scoring: a geometric interpretation,Technical Report, Department of Mathematics (1989), University of Queensland.10. J.T. Kent, Robust properties of likelihood ratio tests, Biometrika 69 (1982), 19{27.11. P. McCullagh, Quasi-likelihood functions, Ann. Stat. 11 (1983), 59{67.12. P. McCullagh and J.A. Nelder, Generalised Linear Models, 2nd edition, Chapman and Hall, 1989.13. D.L. McLeish and C.G. Small, The Theory and Application of Statistical Inference Functions, LectureNotes in Statistics 44, Springer Verlag, 1988.14. P.A.P. Moran, The uniform consistency of maximum likelihood estimators, Proc. Cam. Phil. Soc. 70 (1971),435{439.15. P.A.P. Moran,Maximum-likelihood estimation in non-standard conditions, Proc. Cam. Phil. Soc. 70 (1971),441{450.16. M.R. Osborne, Estimating nonlinear models by maximum likelihood for the exponential family, S.I.S.S.C.8 (1987), 446{456.17. M.R. Osborne, Notes on the method of scoring, DMS report ACT 87/20 (1987), C.S.I.R.O..18. J.W. Pratt, Concavity of the log likelihood, J. Amer. Stat. Assoc. 76 (1981), 103{106.19. R.W.M. Wedderburn, Quasi-likelihood functions, generalised linear models, and the Gauss-Newton method,Biometrika 61 (1974), 439{447.20. H. White, A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedas-ticity, Econometrica 48 (1980), 817{838.21. S.L. Zeger, Kung-Yee Liang, and P.S. Albert, Models for longitudinal data: a generalised estimating equa-tion approach, Biometrics 44 (1988), 1049-1060.

FISHER'S METHOD OF SCORING 25log10(titre) dead deformed normal�0:42 0 0 180:58 1 2 131:58 5 6 42:58 12 6 13:58 18 1 04:58 16 0 0Table 6.1: Cattle virus data

iteration K(�) �rKh �1 �2 �30 49:31 �4:597 �3:145 :74051 47:53 :1353 + 2 �3:937 �2:369 :78342 47:00 :9778 + 0 �4:419 �2:584 :88823 46:99 :2145 � 1 �4:405 �2:620 :90604 46:99 :7370 � 5 �4:405 �2:619 :90605 46:99 :8861 � 8 �4:405 �2:619 :9060Table 6.2: Results of computation with trinomial data(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)(1) 0:05 0:00 0:00 0:10 0:25 0:05 0:50 1:30 1:50 1:50(2) 0:00 0:05 0:05 0:30 0:75 0:30 3:00 7:50 1:00 12:70(3) 1:25 1:25 2:50 16:60 2:50 2:50 0:00 20:00 37:50 26:25(4) 2:50 0:50 0:01 3:00 2:50 0:01 25:00 55:00 5:00 40:00(5) 5:50 1:00 6:00 1:10 2:50 8:00 16:50 29:50 20:00 43:50(6) 1:00 5:00 5:00 5:00 5:00 5:00 10:00 5:00 50:00 75:00(7) 5:00 0:10 5:00 5:00 50:00 10:00 50:00 25:00 50:00 75:00(8) 5:00 10:00 5:00 5:00 25:00 75:00 50:00 75:00 75:00 75:00(9) 17:50 25:00 42:50 50:00 37:50 95:00 62:50 95:00 95:00 95:00Table 6.3: Incidence of R.secalis on leaves of 10 varieties grownat 9 sites: percentage leaf area a�ected

26 M.R.OSBORNEiteration DX DF NB NR XINC RAT1 2:93 + 0 �2:39 + 2 2 12 1:7542 2:13 + 0 �4:55 + 1 2 11 1:335 :7273 7:83� 1 �1:37 + 1 2 11 1:369 :3674 5:36� 1 �2:82 + 0 2 8 1:089 :6845 1:65� 1 �3:36� 1 1 6 0:971 :3086 6:83� 2 �4:26� 2 2 6 1:022 :4147 2:65� 2 �7:32� 3 1 5 0:940 :3888 1:10� 2 �1:11� 3 2 6 1:006 :4159 4:39� 3 �1:98� 4 1 4 0:932 :39910 1:83� 3 �3:09� 5 1 1 0:998 :41711 7:40� 4 �5:60� 6 1 2 0:931 :40412 3:12� 4 �8:99� 7 1 1 0:981 :42213 1:25� 4 �1:60� 7 1 1 0:943 :40114 5:32� 5 �2:63� 8 1 1 0:980 :426Table 6.4: Performance of scoring: quasi-likelihood case(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)(1) 0:000 �:000 �:000 0:000 0:001 �:001 0:001 0:004 0:007 �:002(2) �:001 �:000 �:001 �:001 0:002 �:002 0:015 0:039 �:022 0:061(3) �:004 0:002 0:007 0:123 �:037 �:036 �:152 �:110 0:091 �:194(4) 0:012 �:003 �:014 �:002 �:022 �:046 0:133 0:301 �:176 0:017(5) 0:033 �:004 0:037 �:043 �:054 0:003 �:021 �:070 �:137 �:083(6) �:016 0:033 0:028 �:015 �:044 �:042 �:118 �:362 0:118 0:183(7) 0:003 �:029 �:001 �:064 0:339 �:058 0:160 �:314 �:033 0:043(8) �:047 0:037 �:055 �:169 �:044 0:461 �:028 0:012 0:037 �:090(9) �:123 0:040 0:110 �:025 �:247 0:334 �:190 0:033 0:043 �:004Table 6.5: Residuals by site and varietyG.P.O. Box 4, CanberraACT 2601Australia

fisher's method of scoring · analysis problems. the main aim is to examine asp ects of the...

Documents