the reliability of selected techniques in clinical arthrometrics · 2017. 2. 28. · movements...
TRANSCRIPT
The Reliability of Selected Techniques in ClinicalArthrometrics
A number of studies which have examinedreliability of spinal assessment procedures inmanual therapy are reviewed. The tests examined were Passive Accessory IntervertebralMovements, Passive Physiological Intervertebral Movements, Straight Leg Raise and Forward Flexion. In general, tests of pain werefound to be much more reproducible than testsof compliance. Straight Leg Raise and ForwardFlexion tests were consistently more reliablethan the Passive Intervertebral Movement tests.Possible explanations for these findings are advanced. The role of tests of compliance basedon passive intervertebral movements in clinicaldecision-making may need to be re-examined.An appendix on reliability theory is included forthe uninitiated reader.
THOMAS A. MATYAS
Thomas Matyas, B.A.(Hons), Ph.D., is a Senior Lecturer in the School of Behavioural Sciences, LincolnInstitute of Health Sciences, Melbourne.
TIMOTHY M. BACH
Timothy Bach, M.Sc., is Lecturer in Biomechanics inthe School of Biological Sciences, Lincoln Instituteof Health Sciences, Melbourne.
Manual therapy employs a variety ofassessment techniques such as the forward flexion (FF) test, the straight legraise (SLR) test, passive accessory intervertebral movements (PAIVM) andpassive physiological intervertebralmovements (PPIVM). Collectivelythese tests and other similar ones maybe taken to define the field of 'clinicalarthrometrics' .
Clinical arthrometry provides the basis for a laudably empirical approachto treatment. Among other goals, testing is variously employed to help inthe selection of a region for treatment,in the selection of appropriate manualtechniques and in monitoring caseprogress. Clearly, then, the adequacyof the assessment procedures is a majorissue in the field. However, inspectionof the journal literature to 1980 revealed a remarkable dearth of systematic investigations into the reliability,validity and scaling properties of theclinical assessment procedures employed by manual therapists. Consequently, a research programme was in-
itiated in 1980 with the intention ofclarifying some of these issues.
The aim of the present paper is toreview several studies whose commontheme is the reliability of some techniques in clinical arthrometry. The majority of studies reviewed below arepart of a continuing programme of research being carried out at the LincolnInstitute of Health Sciences in conjuction with its postgraduate curriculum.Studies were conducted by postgraduate physiotherapists working underthe guidance of experienced cliniciansand one or both of the authors.
The paper is organized in five sections. The first section describes amethod for measuring forces appliedduring manual procedures. The secondsection reviews studies on the reliabilityof pain measurement with three manual techniques: the PAIVM test, theFF test and the SLR test. The thirdsection reviews studies on assessmentof spinal compliance with PAIVM andPPIVM tests. The fourth section describes our studies on the reliability of
producing two grades of mobilizationdescribed by Maitland (1977). Although these are not studies of assessment techniques, the findings are relevant to those of section three. Sectionfive conducts an integrative discussionof the studies performed to date. Eachsection also attempts to integrate theresults of pertinent publications generated outside our programme.
I. An Indirect Method forEstimating Applied ForceDuring TherapeuticProcedures
Studies of the reliability of therapeutic techniques have been limited bya lack of objective measures of therapist performance. While therapist perceptions may be readily obtained,measurement of the mechanical effectof therapeutic intervention is confounded by the requirement that measurement techniques should not interfere with the task. To overcome thisrestriction, we have developed a
The Australian Journal of PhYSiotherapy. Vol 31, No 5, 1985 175
Reliability in Clinical Arthrometrics
method which enables the indirectmeasurement of forces applied by therapists during mobilization and assessment techniques.
The procedure requires that therapists perform their assessment or treatment techniques while standing on aforce platform. Figure 1 illustrates theposition of the therapist during application of postero-anterior pressure tothe lumbar spine of a patient and indicates the three forces acting on thetherapist. For this situation we canwrite:
F + G - W = rna (1)
where W is the weight of the therapist,F is the reaction to the force appliedby the therapist to the patient, G is theground reaction force measured by theforce platform, m is the mass of thetherapist, and a is the acceleration ofthe centre of gravity of the therapist.In order to solve this equation for theapplied force, F, values of W, G anda must be known. The ground reactionforce G, is readily obtained from theforce platform as is the body weightW, when F and a are zero. Techniquesare available which enable computation of the acceleration of the centreof gravity a, but these techniques aretoo tedious and time consuming forroutine application. An alternative approach is to make some assumptionsabout the behaviour of a during mobilization and assessment techniques.
For some of the experiments reported here these assumptions presentlittle difficulty. If a therapist palpatesa point in range and holds that pointfor a brief period of time (0.5s-1s) whilerecordings are made, acceleration canbe assumed to be virtually zero overthis period. For the purposes of thispaper, this method will be termed thestatic force measurement technique.Similarly, if a therapist performs oscillatory mobilizations and force platform data is sampled over a muchlonger period of time (20 or more oscillations), the average acceleration overthe sampling period will be virtuallyzero (otherwise the therapist would ac-
Figure 1:The forces which act ona therapist performing spinal mobilization or palpation are bodyweight, W; the ground reactionforce, G; and the reaction to theforce applied to the patient, F.
quire a net positive or negative velocity). The difference between bodyweight and the measured ground reaction force is an accurate estimate ofmean applied force in both cases.
In other experiments considered here,estimates of oscillation amplitude andpeak applied force were required. Themethod employed in these studies willbe termed dynamic force measurement.Instantaneous values of applied forceare much more susceptible to inertialeffects than average values. Bach (1985)has adopted an empirical approach to
estimate the degree of error involvedin using the force platform output (estimated force) as an indirect measureof applied force under different conditions of movement amplitude andfrequency. Bach (1985) found that theerror associated with oscillation amplitude measurement by this techniquewas approximately 12070. The error ofestimating peak forces by this technique was in the neighborhood of1-3070 depending on characteristics ofthe applied forces.
In the studies reviewed in this chapter we have measured only the verticalcomponent of force. Many assessmentand treatment techniques require thatforce components other than verticalbe applied. However, studies describedhere concentrated on postero-anteriorcentral vertebral pressures on prone patients and therefore primarily verticalforces were involved. In one experiment (Collis-Brown 1982) involving 192measurements of applied force duringa posterior-anterior PAIVM assessment the mean difference between thevertical component of the applied forceand the total applied force was 1.8N.This represented 0.5070 of the totalrange of measured forces. We havetherefore chosen to neglect horizontalcomponents of the applied force intechniques involving primarily posteroanterior movements.
An unresolved issue is that of thepressure distribution between therapistand patient. In studies of applied forcereported here therapists were requiredto use the pisiform techniques as described by Maitland (1977, p.137). Thistechnique involves placing the handsso that the point of contact with thespinous process is the medial border ofthe hand between the pisiform and thehamate. The purpose of this placementis to localize the pressure distributionas much as possible. The proportionof total applied force which acts onthe vertebral body itself could differbetween therapists and between patients as a result of anatomical variation in soft tissue distribution in boththe hands of therapists and the backs
176 The Australian Journal of PhySiotherapy. Vol 31, No.5, 1985
Reliability in Clinical Arthrometrics
of patients. To our knowledge, thereis no method available for obtainingprecise information on these pressuredistribution patterns but differences arelikely to be very small. Furthermore,these errors are fixed by the experimental designs employed: the therapists' hands do not change; the individuals tested and retested are the same;the anatomical loci are the same. Thusonly absolute force values will be subject to pressure-distribution error. Reliability coefficients, which are only affected by random error, will not beinfluenced (see Appendix).
II The Reliability ofSome Movement Testsof Pain in the LumbarSpine
Tests of pain may employ either passive movements as in PAIVM, PPIVMand SLR tests or active movements asin the FF test. These tests are employedto chart 'pain behaviour' (Maitland1977). Although other features, suchas 'quality' of pain, may also pertain,'pain behaviour' is often conceived asa two dimensional function: pain versus range of movement (ROM). Keyfeatures of this function are: the pointin ROM of pain onset (PI); the painintensity at the limit of movement(when the limit is caused by factorsother than pain), or the point in ROMwhere pain is of sufficient intensity tolimit movement (P2); and the dynamicsof pain intensity between PI and P2,ie the nature of the change in painintensity as a function of ROM. Painassessment is an essential feature ofinitial diagnosis, acute pre-post evaluation of manual intervention and longerterm evaluation of intersession development. Therefore intertherapist reliability, within-session test-retest reliability and between-session test-retestreliability are all relevant practical issues for evaluating PI, P2 and paindynamics. Our studies to date have examined only some of these issues.
Results witb tbe PAIVM testCollis-Brown (1982) and McNeill
(1982) examined in a within-session design the test-retest and intertherapistreliability of locating PI in ROM whenusing PAIVM. Four physiotherapistswith postgraduate qualifications inmanual therapy examined two segments from each of 12 patients. Patients were included if prior examination revealed: a history of back painor current back pain; a non-irritablecondition; and discernible pain onsetin at least two lumbar levels on application of PAIVM. Patients were examined in prone with the two relevantlumbar levels pre-marked. As much ofthe upper and lower body was coveredas was possible in order to reduce bodyidentity. No communication was permitted other than the response 'Now'to the question 'Tell me when the painstarts'. The static force measurementtechnique described earlier was used tomeasure applied forces. Therapists recorded their conclusion on a l00mmvisual analogue scale (VAS). This permitted simultaneous measurement ofthe force at which PI occurred and thesubjective distance from ROM originwhere PI occurred according to thetherapist. To control for series effectstherapists examined the patients in alatin square design (Meyers and Grossen 1974), with patients randomly allocated to four groups of three. Afterthree patients were examined by alltherapists the entire procedure was repeated. The experimental design therefore provided 24 test and 24 retestmeasurements from each of four therapists under conditions which attempted to minimize information otherthan the PI response to PAIVM.
ColJis-Brown (1982) found that theaverage test-retest reliability coefficientfor palpation conclusions was 0.73.This was only a little less than theaverage correlation between the test andretest forces required to produce a PIresponse, which reached a value of0.83. The difference between the twocoefficients was not statistically significant. In terms of classical reliability
theory this implies that 270/0 of thevariance in PI observed by palpationwas due to random error. This errormay be conceived as a composite resultof at least two processes: randomchanges in the patients's pain condition, or in the verbal report; and random error in the therapist's ability toperceive the point in ROM where PIwas reported and record it on the VAS.An estimate of the first component maybe obtained from the test-retest correlation of the forces, which do notdepend on therapist perception and recording ability. This method estimatesthat 17070 of the observed score variance was due to random error in thepatient's report, although the true valuewill be somewhat lower because anamount should be allowed for the random error in force measurement.Nevertheless, it was apparent that random error due to therapist perceptionand recording was small.
From a practical point of view, however, random error destroys judgementreliability irrespective of its genesis inthe patient or the therapist. To makeinterpretation of patient changesclearer, confidence intervals were computed for the therapist judgements.These estimated that for 950/0 confidence that a change does not reflectmerely random error, a therapist mustobserve a change of at least 340/0 offull scale on the VAS. In clinical situations confidence as low as 80070 maysometimes suffice. This was estimatedto require a change of at least 22070 onthe VAS. It is difficult, given the lackof evidence on the size of the effectrequiring measurement, to decide if therandom error is sufficiently small.
McNeill (1982) examined the degreeof intertherapist reliability present inthe above experiment. The average intertherapist correlation was 0.62. Thisindicates that a substantial proportionof the variance in observed scores(38070) was attributable to intertherapistvariation in performing the test. Theintertherapist correlation in forces required to produce PI was 0.75, whichwas not significantly lower than the
The Australian Journal of PhYSiotherapy. Vol 31, No.5, 1985 177
Reliability in Clinical Arthrometrics
intratherapist value of 0.83. A largeportion of the variability in intertherapist correlations was attributable torandom error in patient report (25070)and a smaller portion to differencesbetween therapists (13070).
The effect of conducting a broaderPAIVM test, including compliance features, spasm, and a complete chart of'pain behaviour', was investigated subsequently by Flint (1983). Four manualtherapists with postgraduate qualifications independently examined onelumbar level from each of twelve patients. The patients were selected fromseveral clinics providing that a screening physiotherapist identified, following a full examination (Maitland 1977),current back pain attributable to thelumbar region. The patients were examined in a latin square sequence asin the earlier study. The movement diagram described by Maitland (1977) wasemployed as a two dimensional VASof Intensity x ROM (67xl00mm). Ther..apists recorded PI, P2, the dynamicsof pain between PI and P2, as well asthe other features typically required bya Maitland movement diagram: thelimit of range (L); the point in ROMof resistance onset (Rl); limiting resistance (R2); the dynamics of resistance between Rl and R2; and the behaviour of muscle spasm, if present(Maitland 1977). The screening physiotherapist premarked the level to betested, which was the 'most symptomatic' level found in the prior examination. Therapists were required topalpate only the marked level usingcentral PAIVM. No other patient information was given to the therapists.
Flint (1983) found that the mean intertherapist correlatioc for locating PIin ROM was 0.48, somewhat lowerthan the 0.62 obtained by McNeill(1982). Although this difference is notstatistically significant, the result indicates that additional palpation information failed to improve the reliabilityof PI ratings.
Furthermore Flint's sample had amore acute status than that employedby McNeill. Thus the result also failed
to support the hypothesis that PI ratings from more acute patients wouldprovide better reliability because acutepatients are likely to have a clearer painonset, with a distinct 'bite of pain'(Maitland 1977, Collis-Brown 1982,McNeill 1982).
As a part of the same study, Flintalso examined intertherapist reliabilityin measuring pain intensity at P2. Shefound a mean intertherapist reliabilitycoefficient of 0.75, a relatively goodresult and the best intertherapist reliability coefficient obtained to date inour PAIVM investigations. It is interesting to note that this feature of themovement diagram is probably morereliant on the patient's response andless reliant on the therapist's abilitythan any other PAIVM finding.
A final aspect of the reliability ofpain assessment investigated by Flintwas the degree of intertherapist agreement on whether pain, spasm or resistance was the cause of movementlimitation. The mean pairwise agreement was 66.6070 which proved significantly higher than the expected randomagreement rate (51.8070) given the obtained base rates. Nevertheless, an intertherapist disagreement rate of 32.4010is substantial in a practical sense, sincethe decision about the cause of movement limitation plays a significant rolein selecting treatment approach (Maitland 1977).
Results with the SLR testThe SLR is a widely used test,
recommended (Cyriax 1982) for bothdiagnosis and progress evaluation. It isassociated with a considerable body ofliterature discussing its underlyingprocesses (Goddard and Reid 1965,DePalma and Rothman 1970, Murphy1977, Breig and Troup 1979 and Cyriax1982). Like the PAIVM test it employspassive movement, but the movementis 'physiological' rather than 'accessory'.
McFarlane (1981) examined the reliability of assessing pain onset as apoint in ROM during the SLR. Twentypatients with low back pain of recent
origin were selected from several Melbourne hospitals provided that they didnot show an unusually high anxietycomponent, or failed to show a changein symptoms under 80° of SLR, orshowed restricted movement or pain inthe squatting test. Five SLR tests topain onset were performed on eachsubject with a 90 second inter-test interval. A gravitational goniometer, wasused to record the angle at PI. Medialhip rotation was manually controlledas suggested by Breig and Troup (1979).
A mean test-retest correlation of 0.96was found between adjacent pairs oftrials, indicating a very high reliabilityfor this test. On the basis of McFarlane's data we calculated that achange of at least 13.6° should be observed in P1 if changes due to randomerror are to be excluded with a certainty of 95070. If typical normal ROMis estimated around 90° (DePalma andRothman 1970, Cyriax 1982) the 95070confidence interval for test-retestchange is 15070 of scale, which is betterthan the 34070 obtained with thePAIVM test (Collis-Brown 1982). Thusboth metric and metric-free estimatesof reliability show better values for theSLR test.
In addition, McFarlane examined thepossibility that systematic trends mayoccur in the SLR data. She found thatthe range to pain onset increased between successive tests by an average of1.2° which was a statistically significant effect. Therefore increases in rangeto pain onsets of 15 0 would probablybe safer minima for error free estimatesof therapeutic improvement betweensucceeding tests obtained within session.
A subsequent experiment performedby Puentedura (1983) to examine theeffects of trunk position on the SLRtest indirectly yielded confirmatory evidence of high reliability for this test.Puentedura recorded pain onset andlimiting pain in seventeen young, nonsymptomatic subjects who reported nohistory of chronic musculoskeletal illness. Electrogoniometric readings wereobtained with the trunk in three posi-
178 The Australian Journal of PhYSIotherapy. Vol. 31, No.5, 1985
Reliability in Clinical Arthrometrics
tions: neutral, maximal contralateralflexion and supported lumbar lordosis.All tests were performed in supine ona flat surface. Within each posture tenpain onset and two limiting pain observations were performed. However,since repeated measures within eachposture were obtained with no intervening treatment, we were able to reexamine Puentedura's raw data for testretest reliability coefficients. Regardless of posture these proved to be uniformly high. The mean test-retest correlation between adjacent pain onsettrials within a posture was 0.98. Thelimiting pain data yielded an averagecorrelation of 0.96. These results confirm and extend those of McFarlane.
In the period between the studies ofMcFarlane and Puentedura three publications appeared (Hoehler et af 198~
Lankhorst et af 1982, Million et of 1982)which seem to further confirm the highreliability of the SLR test. Million etof (1982) found a within session retestreliability of 0.97 using nineteen patients. Lankhorst et of (1982) using anactive SLR reported error componentsfor both interobserver and interday aspects from a factorial design appliedto 48 low backache patients. From theirresults we calculated an interday testretest reliability of 0.96-0.97 and interobserver reliability of 0.93-0.96.Slightly poorer results were found byHoehler and Tobis (1982) for interobserver reliability when measuringpassive SLR (r = 0.78), although foractive SLR the results were comparable(r = 0.95).
Results with the FF testLike the SLR test, the FF test in
volves 'physiological' movement.However the test is one of active ratherthan passive movement. During activeforward bending in the sagittal plane,with the knees extended, several parameters may be recorded. These includeROM to pain onset (PI) and ROM tomaximum pain tolerance (P2) amongothers. The test is widely used as a partof various approaches to examination
of the lumbar spine (Maitland 1977,Stoddard 1980, Cyriax 1982). The purpose of this subsection is to review fourstudies our group performed on thereliability of the FF test for measuringpain parameters.
Several methods for recording ROMduring FF tests have been reported including skin distraction (Macrae ~d
Wright 1969, Van Adrichem and VanDer Korst 1973), spondylometry(Twomey and Taylor 1979, Stoddard1980), inclinometry (Loebl 1967), tangential hydrogoniometry (Andersonand Sweetman 1975), radiography(Hauley et of 1976) and photography(Troup et al 1967). Some of the previous literature investigating the adequacy of these measurement methodshas been concerned with their relativevalue for assessing spinal mobility(Troup et af 1967, Van Adrichem andVan Der Korst 1973, Reynolds 1975,Moran et al 1979). Much of the evidence has been collected from normalsamples (Loebl 1967, Troup et a/1967,Van Adrichem and Van Der Korst1973, Reynolds 1975, Moran et af1979). The purpose of the studies reported below was to examine painmeasurement with a view to clinicalapplication. Therefore simplicity was acriterion for selecting the approach tomeasuring ROM. This excluded radiographic and photographic methods.
The method adopted was to measurefingertip position using a measuringtape (Kapanji 1974). Apart from itssimplicity this method seemed appropriate because kinesiological analysissuggests that it is influenced not onlyby spinal movement, but also by hipmovement and a variety of associatedstructures including muscle and connective tissue (Farfan 1973, Van Adrichem and Van Der Korst 1973, Hartet 0/1974). While this is a disadvantagefor the assessment of specific mobilityin the lumbar spine (Moll and Wright1976), it may be an advantage in themeasurement of pain, particularly painprogress, where a variety of structuresmay be implicated.
Kwong (1981) investigated the testretest reliability of assessing PI withthe FF test. Twenty patients attendinga physiotherapy clinic were sampledprovided they had low back pain without either hip involvement, a list, orscoliosis. Patients were assessed inbriefs and bare feet after the prominence of the tibial tuberosity wasmarked. They were required to bendforward, sliding their hands down theirthighs, without deviation from the sagittal plane, until pain onset. Patientswith pre-existing background pain wereinstructed to stop on onset of a painchange. Using the tibial tuberositymark as origin, ROM to PI was recorded by measuring the distance tothe tip of the midfinger with a nylontape. Three measurements with an intertrial interval of one minute were obtained from each patient. The meantest-retest correlation between trialpairs was 0.98, indicating very highreliability. Using Kwong's data we calculated that changes of 83mm or morewould give 95070 confidence that theobserved change was not the result ofrandom error of measurement.
Systematic error due to repeatedmeasurement was assessed by comparing the central tendency in the threesamples. No statistically significant differences were obtained, although therewas a suggestion that an initial practicetrial might stabilize the data.
Using the same FF measurementtechnique, Bruce (1981) investigated thetest-retest reliability of assessing thepoint of limiting pain (P2). Twentypatients with low back pain were selected from a private physiotherapyclinic provided they were not restrictedby bilateral hamstring tension, or hadless than 600 ROM, or had an 'irritablecondition'. Patients were randomly allocated to two groups of ten. Threemeasurements were taken from all subjects. An objective examination of thespine (Maitland 1977) was interpolatedin one group between the first and second measure and in the other groupbetween the second and the third measure. The other intertrial intervals were
The Australian Journal of Physiotherapy. Vol. 31, No.5, 1985 119
Reliability in Clinical Arthrometrics
three minute rests. Test-retest correlation, when only rest intervened between the two trials, was 0.98. Thiswas consistent with Kwong's results.Test-retest correlations between trialsseparated by the objective spinal assessment were 0.87 and 0.99. Using thereliability coefficient of 0.98 andBruce's raw data we calculated thatchanges of 33mm or more would give95070 confidence that the observedchange was not the result of randomerror of measurement. No systematicbias due to repeated measurement wasfound, replicating Kwong's data.
The studies performed by Bruce andKwong were limited to assessing withinsession retest reliability. While evaluation of within-session progress is amain use of assessment in manual therapy, the results of Bruce and Kwongare not necessarily generalizable to between-session retest intervals. Therefore Patterson (1982) examined limiting pain in FF on two consecutive days.Three FF tests were conducted on Day1 separated by one minute rest intervals. The procedure was repeated onDay 2. A sample of 12 subacute orchronic low back pain oatients wereselected, using similar criteria to thoseof Bruce and Kwong. The mean withinsession retest reliability was found tobe 0.98, confirming the findings obtained by Bruce. The mean betweensession retest reliability was 0.97, notsignificantly lower than that obtainedwithin-session. The 95070 confidence interval for measuring changes withinsession was 45mm, slightly higher thanthat obtained by Bruce. The 95070 confindence interval for measuring changesbetween days was 52mm.
Maitland (1977, p.171) recommendsthat a therapeutic effect should onlybe assumed if an improvement of25mm or more in limiting pain is obtained. This conclusion, based on clinical observation and in the absence offormal analysis, compares well to ourexperimental estimates. In terms of therandom error estimate obtained byBruce, changes in excess of 25mm afford 87070 confidence. In terms of Pat-
terson's within-session estimates 25mmchanges afford 76<1/0 confidence. Between-session conclusions should betaken even more conservatively: ourcalculations based on Patterson's dataestimate only 68070 confidence for aminimum change of 25mm.
Another possibility for measurementerror on reassessment is that repeatedexposure to the same test may createa systematic bias. Serial effects mayoccur as a result of changes in therelevant anatomy/physiology caused bythe initial test, placebo phenomena, orsimply skill learning. In Kwong's studyno statistically significant differenceswere obtained among the three trials.The largest mean difference was only6mm and occured between the samplesof trials 1 and 2.
The most recent study in this serieswas designed to determine if the highreliability found in the three previousstudies was an artefact of the way thetest was performed. At least two obvious hypotheses might be invoked tosuggest that the high retest reproducibility resulted from factors other thanpain sensation. One hypothesis is thatvisual and tactile feedback was available to patients in these studies sincethey could see their own performanceand the test procedure required thehands to slide down the legs. Anotherhypothesis, more difficult to test, isthat the high reproducibility merelyrepresents memory for movementrather than recurrence of a given painlevel at the same point in ROM.
To investigate these hypothesesMunro (1983) examined the within-session retest reliability on a modified FFtest. The test was performed with ablindfold. Furthermore, instead ofsliding their hands down their legs,subjects were required to bend forwards while depressing a low-frictionplunger vertically with the tips of themiddle fingers (Moll and Wright 1976).The plunger was part of an apparatuscontaining a metric scale and pointerwhich permitted location of movementendpoint to the nearest millimetre. Afinal modification to the previous pro-
cedure was that a simple motor task(manipulation of a nut and bolt) wasinterpolated between the test and theretest in an effort to produce somedisruption in sensorimotor memory.Two groups of subjects were tested.The first group comprised 17 low backpain patients selected along criteriasimilar to those of the earlier studies.The second group comprised 17 asymptomatic subjects. Subjects were selectedin the asymptomatic group on amatched pair basis with a low backpain subject. The matching criteriawere gender and age parity (within 6years). The asymptomatic member ofeach matched pair was required to perform a task yoked to the initial performace of the low back pain subject.A mechanical block, placed at the samepoint where the symptomatic subjectshowed pain onset, was used to stopforward bending of the asymptomaticsubject during the test. During the retest, which followed the interpolatedtask, the block was not present andasymptomatic subjects were requiredto simply stop at the point as they recallit from the initial test. Symptomaticpatients were required to stop on painonset during both tests.
Despite the blindfold and the interpolated task, symptomatic subjectsshowed a test-retest correlation of 0.99.Statistical analysis revealed that thiswas significantly higher than the correlation shown by the asymptomaticgroup (0.92). The high reliability obtained confirmed the earlier FF data(Bruce 1981 , Kwong 1981, Patterson1982). More importantly, however, thesuperior reliability of the symptomaticgroup under these stringent performance requirements suggests that painsensation was contributing, rather thanvisual or tactile feedback. Similarly,performance on memory alone can berejected, although a more convincingdemonstration could probably havebeen obtained by employing a longerintertest interval and an interpolatedtask using the same joints as the FFtest, but which does not aggravate thepain.
180 The Australian Journal of PhYSiotherapy. Vol. 31, No.5, 1985
Reliability in Clinical Arthrometrics
In conclusion, our studies of the FFtest for pain have consistently produced high reliability estimates andsuggest that pain is indeed being accessed. This finding is in contrast tothe deprecatory conclusions of someother authors (Hart et at 1974, Reynolds 1975, Moll and Wright 1976,Moran et a/ 1979). The FF test is saidnot to be a good measure of spinalmobility (Hart et a/ 1974, Reynolds1975, Moll and Wright 1976). This maybe so but the point is irrelevant to themeasurement of pain and its progress.The FF test is said to be influenced bystructures other than those of the lumbar spine (Reynolds 1975, Moran et at1979). We have already addressed thisissue indicating that from the point ofview of monitoring pain progress thismay be an advantage. In general there;;fore, results indicate that the FF testshould not be overlooked as a simpleand reliable clinical test for assessingpain changes, particularly if other aspects of the assessment have established the nature of the underlying painprocess. Finally, it is interesting to notethat the reliability coefficients of theSLR and FF tests, both of which involve 'physiological movements', werecomparable and consistently higherthan those obtained for PAIVM testsof pain.
III The Reliability ofSome ClinicalProcedures forAssessing Compliance
Manual tests of spinal complianceprobably form the most characteristically unique contribution of manualtherapy to the diagnostic armamentarium. Their objective is to employ thetherapist's perception of displacementand 'resistance' to obtain a subjectivemodel of spinal compliance, which canbe used for a variety of decisions (Maitland 1977). That this involves a perceptual model of spinal compliance,including dynamic parameters, can beseen most clearly in the development
of the two-dimensional movement diagram (Maitland 1977). Manual assessment of compliance contains, in counterpart to pain assessment, some keyparameters: the point in ROM of resistance onset (Rl); the point in ROMwhere resistance limits passive movement (R2); and the compliance function which links Rl and R2. Compliance tests have a role in: initialdiagnosis, including selection of thelevel to be treated and type of mobilization to be utilized; the evaluationof progress within-session followingtreatment; and progress between sessions (Maitland 1977).
Consequently test-retest and intertherapist reliability are relevant issues.The majority of our studies to datehave been concerned with PAIVM(Baker 1981) Millman 1981 , Wong1981, Weeks 1982, Allen 1983, Flint1983) although one study involvingPPIVM (Clarkson 1982) is also reported below.
Studies evaluating tbe reliability of Rland R2 assessment witb PAIVM
Despite the relatively widespread useof the PAIVM assessment proceduresdescribed by Maitland (1977), a reviewof the literature prior to 1981, the timeof our group's initial study (Baker1981, Wong 1981), revealed a remarkable dearth of systematic attempts toevaluate the reliability of these procedures.
An initial study designed to estimateintertherapist reliability for locating Rland R2 in ROM was conducted byBaker (1981) and Wong (1981). Threetherapists independently examined sixspinal levels from each of eighteen subjects. The subjects had an age rangeof 18 to 54 and no history of recentspinal pain. The six levels examinedwere C2, C6, T2, TIO, L2 and L4. Thethree cephalad processes were examined with thumbs in apposition. Thethree caudad processes were examinedwith pisiform technique. Each therapist was required to mark Rl, R2 andthe compliance function linking themon a 45x60mm movement diagram
(Maitland 1977). The ROM to Rl andto R2 was then obtained to the nearestmillimetre. Using these measures, intertherapist correlation coefficients foreach joint were then obtained for eachpairwise combination of therapists.
Intertherapist correlations were lowerthan those obtained in PAIVM tests ofpain. The mean coefficient for Rlacross all spinal levels was 0.30. Thebest mean correlation for a single levelwas 0.64, obtained from L4. This wassignificantly superior to the other coefficients obtained. The mean correlationfor R2 across all spinal levels was 0.28and the best mean correlation for asingle level was 0.58, obtained fromL2. The L2 value was significantly superior to that of C6, T2 and TI0. Otherdifferences between the reliabilitiesgiven by the six levels were not statistically significant. Although the meanreliability coefficients of 0.30 and 0.28were statistically significant, they weredisappointingly low.
In a subsequent study Weeks (1982)examined the within-session and interweek test-retest reliability for locatingRl. Four therapists independently examined three joints from each of twelvesubjects. None of the subjects had ahistory of recent spinal pain. The agerange was 20-50 years. Each therapistpalpated C2, T4 and L5 on two occasions one week apart. Within eachsession the joints were assessed twiceon a rotational basis across the twelvesubjects, ie the examination of elevensubjects intervened between the firstand second assessments within the session. Therapists were required to markthe location of Rl on an 80mm VASmarked in quarters. Apart from theareas to be examined, subjects' bodieswere draped.
Distances to R1 were then used tocompute, for each segment, the withinsession and interweek reliability coefficients for each therapist. The withinsession correlation was 0.46 when averaged across all four therapists and allthree joints. The interweek reliabilitycoefficient averaged an extremely poor0.09, which was significantly worse
The Australian Journal of PhySIotherapy. Vol. 31, No.5, 1985 181
Reliability in Clinical Arthrometrics
than even the disappointingly lowwithin-session correlation.
Since the four therapists examinedthe same subjects it was also possibleto replicate the estimate of intertherapist reliability obtained by Wong(1981). Over both days and across alljoints the mean pairwise intertherapistcorrelation was 0.25, confirming thelow estimate obtained by Wong.
In general, therefore, the two studiesindicated that PAIVM assessment ofcompliance parameters has poor reliability. However these estimates shouldbe interpreted in the light of two methodological issues which weaken thegeneralizability of the estimates. Thefirst issue is that in both studies thesample comprised trainee therapists inthe second half of the postgraduatediploma specializing in manual therapy. It is possible to argue that sucha sample may not have been representative of the ability which a sampleof fully trained and more experiencedpractitioners would demonstrate.
The second issue of generalizabilityrefers to the sample of subjects usedby the two studies, which in both caseshad no recent history of spinal pain,unlike the subjects typically seen inclinical practice. The quantification ofreliability is affected to a degree by therange of individual differences amongthe joints examined. The mathematicaltheory of reliability clearly indicatesthat restricting the range of variationwill tend to reduce the reliability coefficient (see Appendix). The reliabilitycoefficient is the ratio of the true-scorevariance to total (true-score plus error)variance. The size of the error variancemay be assumed to remain constantover the sample as a whole when thesame method of measurement is employed. However, if the true-score variance is reduced because true individualdifferences between the measured entities has been reduced, then the randomerror component will be a larger proportion of the total variation and theoverall correlation coefficient will bereduced. In other words, if the rangeof variation in compliance parameters
which results from individual differences in a non-clinical sample is substantially different from the range obtained in clinical samples, the reliabilityestimates obtained will tend to bebiased. The issue being an empiricalone, the logjcal approach is to examinea clinical sample. Flint (1983), whoseresults have been reported in partabove, chose that approach.
The study carried out by Flint incontrast to those of Weeks and Wong,employed a clinical sample; gave therapists an 'ecologically valid' assessmenttask, since they were required to do afull pain and passive movement diagram on a clinical subject; and usedfour fully qualified therapists with postqualification experience ranging fromnine months to three years. The intertherapist reliability coefficient forlocating RI in ROM was found to be0.38 on the average, which is not significantly higher, in either the statisticalor the practical sense, than that of0.30 reported by Wong.
The reliability of differentiating spinallevels on the basis of compliance perception following PAIVM's
In clinical assessment PAIVM testsmay be used in the attempt to locatecompliance parameters on a perceptualratio scale so that they may be used toguide diagnosis, assess progress and assist in the selection of grades of therapeutic movement typical of the approach described by Maitland (1977).This purpose guided the orientation ofthe studies reported in the previoussubsection. An alternative purpose forPAIVM tests is to assess the presenceof compliance abnormalities by palpation, on a comparative basis, acrossseveral spinal levels. Relevant parameters include 'end feel', soft tissueresistance and postero-anterior amplitude of joint movement (Maitland1977).
Millman (1981) examined test-retestand intertherapist reliability for blinddiscrimination of the stiffest spinallevel. Therapists were blindfolded and
required to select, only by performingPAIVM with pisiform, which of thesix unidentified levels presented in random sequence was stiffest. The levelsincluded were L4 to TIl.
Therapists were permitted to repalpate any levels they were uncertainabout until they came to a firm decision. Each of three therapists examinedthe same thirteen nonclinical subjectson two occasions within one session oftesting. The results indicated that preconceptions about anatomical variation in stiffness were adequately controlled by this procedure becausetherapists' ability to identify which anatomical levels they were on was notsignificantly better than that attainableby chance. Furthermore, therapistswere unable to guess at better thanchance rates when they were performing a retest.
Under these conditions, which imposed a strict dependence on palpatoryinformation, the mean test-retestagreement rate was 31 010. Statistically,this was significantly better than theagreement rate of 16.7070 predicted bya model which assumed that therapistswere randomly selecting one levelamong six. The analysis also showedthat 31 070 was significantly worse thanthe agreement rate of 50010 predictedby a model which assumed that therapists were able to reject four levelswith certainty, but were guessing whichof the remaining two levels was stiffest.The best model was that which assumed therapists were able to rejectthree of the levels but guessed amongthe remaining three. These models are,of course, imaginary. They should notbe taken to imply that therapists decideliterally following the processes assumed by these models. The models dohowever provide a valuable frame ofreference for interpretation.
The analysis of intertherapist agreement showed that the average pairwiseagreement was 25.7010. This was significantly better than the 16.7010 predicted by a model assuming completeguessing. It was also significantly worsethan the 33010 predicted by a model
182 The Australian Journal of PhySiotherapy. Vol. 31, No.5, 1985
Reliability in Clinical Arthrometrics
which assumed that therapists were ableto reject three levels with certainty, buthad to guess among the remaining threespinal levels.
Millman's results therefore suggested that by palpation alone therapists can discriminate better thanchance those differences in stiffness derived from anatomical variation. Unfortunately, the degree of agreement,though better than chance, was nevertheless low from the point of view ofpractical diagnostics. For example, itseems likely that a therapist would beable to narrow the range of clinicallyrelevant levels down to three, or perhaps even two, by using the case history, the other test data and epidemiological knowledge.
However, the generalization of Millman's data faces some problems. First,the source of variation between spinallevels was that due to natural anatom...ical differences in non-symptomaticspines. In clinical decision making thestated objective is to identify the presence of an abnormality. The frame ofreference for the therapbt presumablyis some cumulated memory model ofwhat is normal (Maitland 1977).Whether the difference between the immediate perceptual trace from an abnormal joint and the cognitive templateof normality is an easier discriminationto perform than the discrimination between recently experienced perceptionsof stiffness which differ according toanatomical variation between spinallevels, seems to be a moot point in thelight of the complexity of the issue andthe lack of evidence.
A second problem stems from thechoice of 'stiffest level' (Millman 1981)as the object of discrimination. In clinical theory the finding of abnormalitymay involve a broader base of compliance features. These include 'endfeel' and soft tissue resistance as wellas postero-anterior ROM (Maitland1977).
A third problem arises from thenature of the therapist sample. Thethree therapists all had a minimum of
four years clinical experience. However, although they had satisfactorilycompleted more than half of the postgraduate diploma specializing in manual therapy, including the spinal assessment and treatment portion of thecourse, it may be that the lack of fullqualification and post-specializationexperience was a factor in their performance.
Allen (1983) conducted a study whichattempted to resolve some of the issuesraised by Millman's study. Five lumbarlevels from each of twelve patients recruited from several clinics were examined. All patients had a history ofback pain. Seven patients had symptoms which had persisted over sixmonths. Three physiotherapists withspecialist prostgraduate qualificationsin manual therapy and a minimum ofeighteen months of post-specializationexperience performed the assessments.Millman's procedure was replicated,but therapists were asked to selectwhich level had the greatest soft tissueresistance, which had the most abnormal 'end-feel', and which had thesmallest postero-anterior amplitude ofmovement. In addition, therapists wererequired to indicate which of the fivelevels should be selected for treatmentand which of the three indicators ofabnormality mentioned above had mostinfluenced their selection.
Allen's data revealed a very high degree of coherence between the specificindicators of abnormality. On over970/0 of occasions two or three of theseindicators identified the same level asthat selected to be 'most abnormal'.Therefore reliability estimates wereprepared only for the decision of whichlevel should be selected for treatment.The test-retest agreement rate averaged47.2070, somewhat higher than Millman's 310/0. However, our analysis ofthe results obtained by these studiesdid not indicate the improvement to bestatistically significant. The inter-therapist agreement rate averaged 26.40/0on a pairwise basis in Allen's study.This is very similar to Millman's resultand not significantly better than the
200/0 agreement rate which would beexpected from a random guess model.The high coherence between specificindicators seems to imply either thatabnormal compliance tends to manifestsimultaneously through the severalparameters, or that therapists tend tobe biased towards 'false alarms' of abnormality having found a single abnormal sign from the level in question.The low degree of reliability suggeststhat the latter explanation should bepreferred. Furthermore, since the testretest reliability indicates that some degree of consistent information wastransmitted even though intertherapistagreement was very low, it seemsreasonable to hypothesize that therapists make global judgements of abnormality, on perceptual dimensionswhich are probably not consistent andwhich may be difficult to verbalize.
The reliability of compliance ratingsfollowing PPIVM tests
Passive movement of a 'physiological' type provides another testing approach which may be used for diagnosis or progress evaluation (Maitland1977, Cyriax 1982).
Kaltenborn and Lindahl (1969) examined the intertherapist reliability often therapists during assessment of intervertebral joint mobility. A fourpoint rating scale consisting of nomovement, hypomobility, normalmovement and hypermobility was used.Kaltenborn's ratings were used as acriterion for agreement. Each of thetherapists independently gave 13 assessments. Their conclusion of 'remarkably good' agreement was not accompanied by a formal analysis.However, the following results werereported: complete agreement fromthree therapists; 2 disagreements fromtwo therapists; 3 disagreements fromone therapist; and 4 or 5 disagreementsfrom the remaining three therapists.This represents an average agreementrate of about 84070.
Gonnella et 01 (1982) examined theintertherapist and retest reliability of
The Australian Journal of PhYSiotherapy. Vol. 31, No 5,1985 183
Reliability in Clinical Arthrometrics
five therapists employing PPIVM testson lumbar segments. On each of twodays, which were separated by a 13 dayinterval, each therapist independentlyevaluated the six segments of fiveyoung, nonsymptomatic subjects. Twoevaluations, one under 'normal' andone under blindfold conditions wereperformed within each session. Forward bending, side bending (left andright) and rotation (left and right) wereperformed. A seven point rating scalewas used, with 'ankylosed' and 'unstable' as the end values. In addition'plus' and 'minus' qualifiers were permitted, which produced a potential13-point scale. In practice the scale values employed by the observers werelimited to the range 1-4, producing aneffective seven-point scale biased towards hypomobility. In fact, the distribution was probably even more restricted because the extreme scale values(1.0 and 4.0) seem to have occurredvery infrequently (eg 2070 for forwardbending, the only test for which sufficient data was available to extract aresult). Gonnella et af concluded that'results on intertherapist reliability weredisappointing' (p.442). Although thisconclusion is not immediately apparentfrom their analysis of the data, our reanalysis of the evidence Gonnella et afpresented (p.440) confirmed their conclusion. For example, with the forwardbending manoeuvre we calculate thatintertherapist agreement reached 78070when agreement is defined (Gonnellaet af 1982) as ratings differing by lessthan one full scale unit. However, theagreement rate expected from thechance agreement model is 71 070. Thehigh degree of chance agreement is thecombined effect of a restricted distribution of mobility together with a definition of agreement which accepts avariation of half a scale value (see Appendix).
Thus the PPIVM research literaturepresented until 1982 a somewhat equivocal overview. One study claimed goodresults for intertherapist reliability(Kaltenborn and Lindahl 1969), whileanother found poor results (Gonnella
et a/1982). A further problem was thatof non-generalizability of findings,either because the evaluation samplesincluded few subjects (Kaltenborn andLindahl 1969) or nonsymptomatic subjects (Gonnella et af 1982).
Therefore, Clarkson (1982) investigated the intertherapist reliability offour experienced physiotherapists specialized in manual therapy. The testsample comprised ten subjects aged 2055, all of whom had a history of lowback pain. One subject had a radiographically confirmed sacralization ofL5. Others included a retired dancer,a footballer and a champion runner.That is, there was an effort to obtaina wide cross-section of test joints. Eachtherapist independently assessed eachvertebral segment from S1 to T12 usingthe PPIVM technique for forward flexion described by Maitland (1977).Therapists used a five-point scale withthe end values being 'ankylosed' and'hypermobile'. On the average, thepairwise intertherapist agreement ratewas 45070. Statistically this was significantly better than the 37070 expected tooccur from chance agreement. However, from a clinical point of view itdoes not seem a very encouraging result. When the 'stiff' and 'very stiff'ratings were amalgamated to producea four-point scale like that of Kaltenborn and Lindahl (1969) the agreementrate became 57070. This seems substantially lower than the 82070 obtained byKaltenborn and Lindahl. The resultsare also poorer than the 78070 agreement rate obtained by Gonnella et af(1982), although the comparison iscomplicated by differences in the ratingscales used.
Further evidence about the reliabilityof PPIVM is available outside the research literature of physiotherapy. Rotational manoeuvres similar to thetechniques employed by physiotherapists are encountered in osteopathy(Johnston 1982). Recently, Johnston etaf (l982a) reported on the intertherapist reliability obtained by one osteopathic physician and two student physicians. The tests employed were
cervical rotation, cervical sidebendingand several trunk motions. The experimental sample comprised 161 volunteers which included 84 students and71 patients. However the report doesnot clarify the particular characteristicsof the subsamples used to assess thereliability of the different motions.Therapists were required to indicate ifresistance to passive motion was symmetrical or asymmetrical for left andright manoeuvres. For cervical rotationthe three therapists agreed on 42070 ofthe 43 subjects tested this way. Forcervical sidebending they agreed on33070 of 36 subjects. Although theseagreement rates appear rather low theywere significantly higher than those expected to occur by chance (19070 and14070, respectively). Furthermore, theseare three-way agreements rather thanpairwise agreement rates as in the otherstudies reviewed by this section. Unfortunately the report by Johnston etaf 1982a makes extraction of meanpairwise agreement difficult, therebyprecluding direct comparisons. In addition the ratings required were somewhat different. Nevertheless, in termsof clinical significance, the results seemrather disappointing, a conclusionshared by Johnston et af (1982a).
In a subsequent study on cervicalrotation, Johnston et af (1982b) evaluated intertherapist reliability whenonly subjects with strong indicationsof asymmetry were included in thesample. Preselection of subjects wasbased on agreed examination findingsby two faculty osteopaths. Three student therapists then independently examined the subjects. The pairwiseagreements for each student with thefaculty examiners were 71070, 62070 and57070. While the agreement rate fromthe first student was significantly higherthan expected to occur by chance, thiswas not the case for the other two setsof ratings. Given the preselected sample of subjects and the restriction ofratings to symmetry or left and rightasymmetry, the 63070 average agreement rate is disappointing in terms ofclinical significance, particularly since
184 The Australian Journal of PhYSiotherapy. Vol. 31, No.5, 1985
Reliability in Clinical Arthrometrics
cervical rotation seemed the mostpromising test in the prior study (Johnston et al 1982a).
It may be tempting to dismiss Johnston's low reliability as resulting fromtherapist inexperience, but Kaltenbornand Lindahl's (1969) 84070 agreementrate was based on a group which included a variety of experience. In anycase, poor results were also found instudies with experienced therapists(Clarkson 1982, Gonnella et al 1982).
Interpretation of the studies examining PPIVM test reliability is complicated further by the variety of ratingscales, subjects and spinal levels used.Furthermore, agreement rates are difficult to compare directly because theyare influenced by distributional propJerties including response base rates,which may vary across studies.
To facilitate comparisons of agreement rates we therefore expressed theresults of the above studies in terms ofCohen's kappa (see Appendix). Sincekappa expresses the proportion of obtained agreements relatIve to that expect to occur by chance, it facilitatescomparisons across studies which employ different rating scales, test joints,or other methodological features whichmight alter the statistical properties ofthe therapists' responses. A second advantage is that it is a correlation-likeindex, which varies between zero andone (unless observed agreements areless than expected by chance). Usingdata presented in the published reports,we found kappas of 0.64 for Kaltenborn and Lindahl (1969), 0.37 forJohnston et al (1982b), 0.24 for Gonnella et a/ (1982) and 0.15 for Clarkson(1982). In general therefore the studiesof PPIVM tests do not seem to yielda very good degree of intertherapistreliability, particularly within theframework of clinical requirements. Inview of the variety of therapist backgrounds, subjects used (includingsymptomatic and nonsymptomatic) andother variables, this conclusion probably has good generalizability and con-
curs with the more recent of previousinterpretations (Gonnella et al 1982;Johnston et al 1982).
Reliability of spinal mobility assess..ment using combined PAIVM andPPIVM tests
In addition to the investigations citedin the previous three subsections whichhave involved either PPIVM orPAIVM assessment, a number of studies reported in the literature have usedcombined assessment techniques to ratespinal mobility. Because of the combined nature of the assessment taskutilized in these studies, it is not possible to separate the individual reliability of anyone of the tests involved.However the studies outlined belowprovide some insights into therapistperformance.
Jull (1978) reported a study whichexamined the intertherapist reliabilityof rating the mobility of the upperthree cervical joints following PAIVMand PPIVM tests. Each therapist performed 81 tests ranking each joint ona five point scale with the extremes of'hypermobile' and 'no movement'. Atotal agreement rate of 88070 wasclaimed, which is highly encouraging.However, a number of methodologicalissues suggest that this agreement rateshould be interpreted with caution.Given the relative infrequency of 'hypermobility' and 'no movement' ratings likely to occur in the population,the effective range of variability mayhave been somewhat reduced. Unfortunately, no data on the relative frequency of findings in each categorywere reported. Furthermore, severaldecisions came from a given spinal segment. This could have introduced further restrictions in the (a prion) subjective range of potential variation.Finally the generalizability of the datais limited by the fact that the smallestsample viable for an intertherapist reliability study was used: two therapists.
In a later report, Jull (1982) providedfurther evidence of intertherapist reliability for combined PPIVM andPAIVM tests of lumbar segments. Two
therapists examined one subject onthree successive occasions. The intersession interval was one day. The intertherapist reliability coefficient was0.35, which has been interpreted tomean that 'examiners correlated highly'(Jull 1982, p.75). Although the resultwas significantly different from no correlation, in the statistical confidencesense, a reliability coefficient of 0.35is not high. In fact, the majority ofthe variance in the observed scores isattributable to error when the coefficient is so low. A similar argumentapplies to the intersession reliabilitycoefficient reported to be only 0.10.
In a further study, Jull and Lane(1983) published findings related to assessment of lumbar spinal mobility. Asubsample of 20 normal subjects froma population of 100 males and 100females with no history of back painwere examined. Postero-anterior accessory glide and all passive physiological movements were assessed in sixintersegmental levels from T12/Ll toL5IS 1. Each level was classified on afive point rating scale from 'hypermobile' to 'very stiff'. The retest agreement rate for the single participatingtherapist was 87.3070. Intertherapistagreement on a subsample of five subjects was reported to be 82.2070 betweenthe therapist and an independent observer. Once again, these high agreement rates should be interpreted withcaution because limited sample variability will increase the agreement attainable by chance. On the basis of theaveraged data published by Jull andLane (1983) for their full population,we estimate that an agreement rate of38070 could have been typically expected to occur by chance. If the testsubsample had consisted of only theyounger subjects the chance agreementestimate would have been 61070. Usingthe chance agreement rate for the wholepopulation we computed a Cohen'skappa for the retest agreements of 0.79and for the intertherapist agreementsof 0.71. Using the 61070 estimate, kappavalues would have been 0.67 and 0.54respectively.
The Australian Journal of PhySiotherapy Vol 31, No.5, 1985 185
Reliability in Clinical Arthrometrics
Grant (1980) examined lumbar spinalmobility in groups of dancers and nondancer controls using a number oftechniques including passive movementtests. Within the study, two observersperformed twenty tests on five subjectsrating lumbar levels on a four pointscale from 'hypermobile' to 'very stiff'and an interobserver agreement rate of90070 was obtained. The actual frequency distribution of test findings wasnot included in the report nor was itindicated from which of the experimental groups the subjects were drawn.Therefore we did not proceed to estimate kappa, the more appropriatecoefficient.
It should be pointed out that it wasnot the primary intention of lull (1978,1982), lull and Lane (1983) or Grant(1980) to measure reliability of assess..ment per se, but only to determine thereliability of the therapists who performed assessments for the variousstudies. Consequently, the generalizability of these results is in all caseslimited by the fact that absolute minimum numbers of therapists were involved in both retest and intertherapisttrials. Furthermore it is not clearwhether the judgements resulting fromthe several segments sampled from agiven subject were statistically independent. Lack of independence couldhave artificially raised the estimate ofreliability.
lull and Bogduk (1985) examined thereliability of diagnosis of zygapophyseal joint disorders in a group of twentypatients attending a pain clinic becauseof cervical pain. A trained therapiststipulated the abnormal cervical levelafter a full subjective and objectiveexamination, including passive physiological and accessory movements. Toprovide an objective criterion, medialbranch blocks (Bogduk 1985) were usedto selectively anaesthetize nerves supplying cervical joints. Perfect agreement between the diagnosis of the therapist and the medial branch block wasobtained. A subsample of four subjectswas independently examined by another manipulative therapist, with per-
fect agreement on the abnormal joint.The results of lull and Bogduk (1985)might suggest that palpatory tests canperfectly diagnose the level to betreated. It should be noted howeverthat the patient sample had severe pain,which was often irritable (lull and Bogduk 1985, p.163) and that the manualassessment not only included pain reproduction, but also was conducted inthe context of other information produced by a full objective and subjectiveexamination. Although the authorsclaim that the pathological joints hadsuch abnormal compliance features as'limited range of motion', 'abnormalquality of resistance' and 'abnormallimitation to the movement' (lull andBogduk 1985, p.164), they also reportthat 'reproduction of pain was invariably associated with these abnormalqualities of movement' (p.I64). On thebasis of our experience with assessmentof compliance features (low reliability)and pain (high reliability) an alternativehypothesis is indicated: that provocation and reproduction of pain was thekey factor in reliable identification ofthe injured level. This interpretationseems preferable because it is more parsimonious, being consistent with bothour group's results and those of lulland Bogduk (1985).
IV Reliability in theProduction ofTherapeutic PassiveMovement
The reliability with which therapeutic movement is produced has receivedno systematic investigation accordingto our reviews of the journal literature.The degree of intratherapist or intertherapist variation in production of passive movement is presumably an important factor, at least theoretically,since some descriptions of mobilizationtechniques do identify various gradesand do recommend selective use according to various conditions, eg Maitland (1977). Until systematic empiricalstudies are conducted to assess the dif-
ferences in therapeutic outcome due todifferent grades of mobilization, theactual importance of using selectedgrades of mobilization, or of the reliability with which they are produced,must remain a problem which is justified only theoretically or through clinical anecdote. Nevertheless, given thebroad influence on clinical and educational practice which description ofgrades of mobilization have attained,the issue seems to require far greaterattention than it has received to date.
However, the primary purpose forreporting here two pioneering studies(Banting 1982, Mitchell 1983) conducted in our laboratories on this issueis that the reliability with which selected grades of movement are produced is indirectly related to the reliability with which compliance isassessed. That grades of mobilizationare related to assessment of complianceis clear from descriptions of clinicalprocedures (Maitland 1977). The linkwas even more explicit in the definitions used by Banting (1982) andMitchell (1983) when instructing thetherapists in their studies. Grade II mobilizations were defined as 'large amplitude movements to the point whereRl is just perceived, at a rate of twoto three oscillation per second' (Banting 1982, MitchellI983)~ Grade IV mobilizations were defined as 'a small amplitude movement just up to andtouching the end of available jointrange' (Mitchell 1983). Again two tothree oscillations per second was therecommended oscillation frequency.
To investigate reliability, both studies adopted the strategy of presentingseveral spinal levels from several individuals thus ensuring a variety ofranges and joint mobilities. The reproducibility of peak force of mobilization can then be examined within theframe of reference provided by the variations due to anatomical and individual differences. When the same levels,from the same subjects are examined,the intertherapist and retest correlations for peak force are then akin tothe reliability coefficients for locating
186 The AustraHan Journal of Physlotherapy~ Vol. 31, No.5, 1985
Reliability in Clinical Arthrometrics
R1 and R2 in range presented by otherstudies (Baker 1981 , Wong 1981 , Weeks1982, Flint 1983), particularly given theexplicit definitions used by Banting andMitchell.
In both studie~ the force platformtechnique already described was usedto assess the forces of mobilizationwhile therapists performed centralPAIVM. The output of the force platform was monitored by computer. Thispermitted calculation of peak force ofmobilization for each oscillation, aswell as of oscillation amplitude andfrequency by means of the dynamicforce measurement technique describedearlier. The data on the latter two parameters is important to the wider issueof reproducibility of technique but isless directly relevant to the presenttheme. It is considered in detail elsewhere (Banting et af 1985).
Banting (1982) examined intertherapist reliability in seven physiotherapists with specialist postgraduate qualifications in manual therapy. The leastexperienced therapist had more thannine months of clin~cal practice sincecompletion of the specialist qualification. The sample comprised graduatesfrom schools in three different Australian States. Each therapist mobilizedfour premarked spinal levels (TIl, T9,T7, T5) from each of four subjectsusing central PAIVM delivered withthe pisiform technique. Each level wasmobilised for 20 seconds. Among otherparameters, the peak forces during acycle were calculated and averaged forall the cycles of a trial. Scores fromthe 16 levels mobilized by each therapist were then used to compute pairwise intertherapist correlations. Themean intertherapist correlation was avery poor 0.22. In addition systematicbiases were found between the seventherapists when the peak forces wereaveraged across the 16 spinal levels(Banting 1982). Two therapists showeda 'light touch' (7.6N and 9.8N), threewere two to three times more forceful(14.5N, 16.3N, 20.6N) and two showednine or more times that force (50.2N,87.1N). An analysis of variance con-
firmed these differences to be statistically significant (Banting 1982).
Mitchell (1983) replicated and extended Banting's study. Subjects wereeight experienced physiotherapists withspecialist postgraduate qualifications inmanual therapy. Each mobilized twentyspinal levels comprising T9, TIl, L1,L3 and L5 from one female and threemale volunteers with no history of backpain. The same twenty segments weremobilized again one week later. Thusthe design assessed both intertherapistand test-retest reliabilities for Grade IIand Grade IV movements. In order tomaintain comparability all joints werepre-mobilized by the experimenter.Thus all therapists, including thestarter, were dealing with previouslymobilized spines.
Among other parameters, Mitchell(1983) calculated the peak force foreach oscillation. Following the earlierstudy (Banting 1982), trial averageswere computed, from which intertherapist and retest correlations wereobtained. Mitchell confirmed that intertherapist reliability for Grade IImovements was low (r = 0.25) andshowed that this was also the case forGrade IV (r = 0.16). In addition hefound poor test-retest reliability forboth Grade II (r = 0.22) and GradeIV (r = 0.42).
Systematic biases were also evidentin the data. The peak forces for GradeII when averaged over the twenty segments showed an intertherapist rangefrom 2.2N to 46.7N on Day 1. Eventhe trimmed range, excluding the extreme therapists, was 13.0N to 30.2N.On Day 2 the range was 3.9N to 26.4N.Analysis of variance confirmed thatthere were significant differences between therapists and between days(Mitchell 1983). Similarly, for GradeIV, the intertherapist range was 150.9Nto 329.3N on Day 1 and 89.2N to222.4N on Day 2. Again analyses ofvariance confirmed that there were statistically significant differences between therapists and between days(Mitchell 1983).
The studies of Banting and Mitchellrelate to those for :21 assessment in thecase of Grade II movement peak forcesand to those of R2 in the case of GradeIV peak forces. The findings show verygood consistency. Thus for intertherapist reliability in locating Rl thecomparison figures are 0.30 (Wong1981), 0.25 (Weeks 1982) and 0.38(Flint 1983). These confirm the GradeII results (r = 0.22, r = 0.25). Thecomparison figures for intertherapistreliability in locating R2 are 0.28 (Baker1981) and 0.24 (Flint 1983), which seemto support Mitchell's Grade IV result(r = 0.16). The poor test-retest correlation obtained by Mitchell for GradeII (r = 0.22), is if anything, betterthan the low value obtained by Weeksover the same interval (r = 0.09).Therefore the results obtained byMitchell and Banting reinforce the conclusion of poor reliability for estimation of spinal compliance duringPAIVM.
V DiscussionAn overview of the studies presented
above suggests several patterns in thefindings (cf also Table 1). In general,pain tests were more reliable than testsassessing features of compliance. Thiseffect was obtained even when verysimilar testing techniques were usedsuch as when PAIVM was used forboth PI and Rl assessment. A secondfeature of the results is the excellentreliability obtained with the SLR andFF tests for pain. The correlation coefficients (0.96-0.98) were superior tothose obtained by PI assessment withPAIVM (0.73). These differences arestatistically significant. A third aspectis the consistent finding of superiortest-retest reliability over intertherapistreliability. This is a common result inmost fields of measurement. Before theclinical implications of these findingsare considered it is appropriate to discuss some factors which may accountfor the obtained results.
The Australian Journal of Physiotherapy. Vol. 31, No.5, 1985 187
Reliability in Clinical Arthrometrics
Table 1:Summary of reliability coefficients
Measure Test Retest Inter- CI Sourcemove- r(K) observer (95%)ment r(K)
ROM PAIVM .88 Collis-Brown
PAIVM .86 GrisoldPAIVM .78 McNeill
R1 PAIVM .30 WongPAIVM .25 WeeksPAIVM .38 FlintPAIVM .46 17% WeeksPAIVM .09PAIVM .22 BantingPAIVM .22 .25 Mitchell
R2 PAIVM .28 BakerPAIVM .42 .16 Mitchell
P1 PAIVM .73 34% Collis-BrownPAIVM .83 Collis-BrownPAIVM .62 McNeillPAIVM .75 McNeillSLR .96 13.60 0 McFarlaneSLR .98 PuenteduraSLR .97 Million et alSLR .96-.97 .93-.96 Lankhorst et alSLR .78 HoehlerSLR .95 HoehlerFF .98 83mm KwongFF .99 MunroeFF .91 Million et aJFF .95 .97 Lankhorst et aJFF .50 Hoehler
Comments
after correction for error in patient report.
after correction for error in patient report.
int rasessionintersessionpeak applied force during Grade IIpeak applied force during Grade II
peak applied force during Grade IV
therapist location on VASmeasured force at patient report of P1therapist location on VASmeasured force at patient report of P2
intersessionpassive testactive test
skin distractionintersession,skin distractionskin distraction
Factors which may account for the superior reliability of pain assessment
Although pain tests showed betterreliability than tests of compliance features, there are procedural differencesbetween the pain tests investigated.Therefore the comparison betweenPAIVM assessment of pain and compliance features is probably the mostappropriate for discussion.
In order to locate PI by PAIVM aphysical stimulus is applied. The patient must sense and report pain onset,and the therapist must then relate thatevent to a point in ROM. In order to
locate RIa similar physical stimulus isapplied, the therapist must sense theoccurrence of the 'onset of resistance',then relate that event to a point inROM. For both tests some of the totalerror will be due to stimulus application and some to the ability to locatea point in ROM. Thus the essentialdifference between the two judgemental processes is that tests of pain involveonly one judgement, that of ROM,while tests of compliance require thejudgement of both the compliance feature and ROM. It may appear therefore that the issue is simply a question
of which of these contrasting perceptual processes contains more error.However, the quantitative theory ofreliability shows clearly that reliabilityis a function of both error and truescore variation (see Appendix). Thesame amount of error (in metric terms)means poorer reliability if the true scorevariation is small rather than large.
It is important to note that the lowcorrelations obtained for Rl and R2are at least in part due to the restrictedrange of true score variability. R1 tendsto be restricted to the lower third ofrange while R2 tends to be restricted
188 The Australian Journal of Physiotherapy. Vol 31, No.5, 1985
Reliability in Clinical Arthrometrics
Table 1:Summary of reliability coefficients
Measure Test Retest Inter- CI Source Commentsmove- r(K) observer (95%)ment r(K)
P2 SLR .96 PuenteduraFF .98 33mm BruceFF .97 52mm Patterson intersessionFF .98 45mm Patterson intrasession
Compliance PPIVM (.64) KaltenbornPPIVM (.37) Johnston (1982b)PPIVM (.24) Gonnella et alPPIVM (.15) ClarksonMixed .35 .10 Jull (1982) combined PPIVM and PAIVMMixed (.67- (.54- Jull and combined PPIVM and PAIVM
.79) .71) Lane
Level PAIVM (.16) (.11) Millman stiffest levelSelection PAIVM (.34) (.08) Allen level to be treated
(1.00) Jull and pathological levelBogduk full objective and subjective
examination
to the upper third. We re-examined thedata of Wong (1981), Weeks (1982)and Flint (1983) to confirm this tendency. The standard deviation of R1 inROM occupied respectively 8070, 8.3070and 7.9070 of scale in the three studies,confirming the restricted variability ofR1 in both normal and clinical populations and indicating very good consistency between the three independentstudies. In contrast, the standard deviation of PI was 23.2070 of ROM inthe Collis-Brown (1982) study. We haveapplied equation A.21 (see Appendix)to compute what the obtained correlations would have been had the truescore variability been the same as thatobserved by Collis-Brown for PI. Theintrasession test-retest correlation ofWeeks (0.46) becomes 0.83; the intersession correlation (0.09) becomes 0.25and the intertherapist correlation obtained by Flint (0.38) becomes 0.76. Itseems possible therefore to account forthe poorer reliability of compliancefeature assessment without suggestingthat therapists perceive R1 or R2 morepoorly than patients perceive PI or P2.
It seems rather that therapists face amore difficult discrimination problemwhen attempting to locate Rl.
Factors which may account for the inferior reliability of passive intervertebral movement tests of pain
Passive intervertebral tests, whether'accessory' or 'physiological', invariably yielded poorer reliability coefficients than those of the gross movement tests such as SLR and FF. Toavoid the confounding contribution ofpain versus compliance assessment, anappropriate comparison available fordiscussion is between FF or SLR testsof pain versus PAIVM assessment ofpain.
In FF or SLR tests a gross 'physiological' movement provides the stimulus for pain elicitation, the patientmust then perceive and report pain onset (or similar parameters) and ROMcan be recorded via goniometry ormeasures of relatively large linear displacements. In PAIVM tests a morelocalized movement is the stimulus forpain elicitation, the patient must per-
ceive and report pain onset (or similarparameters), then the therapist mustthrough subjective evaluation of ROM,record where the pain occurred.
The issues for discussion thereforeseem to be: the reliability of subjectiveROM assessment by the therapist versus goniometric or similar methods forROM assessment; and the reliability ofpain elicitation by gross physiologicalmovement versus localized PAIVM.
As might be expected, goniometricassessment is typically reported to showhigh reliability (Leighton 1955, Myers1961, Boone et a/1978, Ekstrand et at1982). However, the reliability of assessing ROM by palpation does notseem to have been previously investigated. Initial evidence that therapistsdo not introduce a very large amountof error at the stage of locating the PIreport in ROM was obtained by CollisBrown (1982). His test-retest correlation when based upon force platformdata, which does not involve therapistjudgement of ROM, was 0.83. Whenbased upon therapist determined datait was 0.73. Thus adding subjective
The Australian Journal of PhYSiotherapy. Vol. 31, No 5,1985 189
Reliability in Clinical Arthrometrics
ROM assessment to the total processdid not reduce reliability substantially.The degree of error due to patient report and force measurement techniqueis represented in the force test-retestcorrelation (0.83). It is possible to calculate what the test-retest reliabilitywould have been if no error had arisenfrom these processes (see Appendix).This indirect estimate of test-retest reliability for locating a point in ROMwas 0.88.
Additional evidence for high intra-therapist reliability of ROM assesmentwas obtained by Grisold (1983). Therapists were asked to palpate end ofrange of a single lumbar level using thepisiform technique. They were thenasked to palpate one, two, three, four,five, six and seven eighths of range ina random order prescribed by the experimenter. This procedure was re-peated eight times, varying the orderof presentation of point in range eachtime.
The static force platform techRique(see Section 1) was used to measureapplied force for each of the 64 trials.Average test-retest correlations were0.86, almost identical tn the 0.88 computed from the data of Collis-Brownafter correction for variation in patientreport.
Nevertheless, although the ability oftherapists to locate a point in ROMseems relatively high, particularly inconsideration of the difficulty of thetask, it is lower than that of goniometric and related techniques (0.96-0.98),thus accounting in part for the lowerreliability of PI assessment throughPAIVM. That the assessment of ROMcannot be the full explanation for thesuperior reliability of the SLR and FFtests is clear from Collis-Brown's (0.83)retest correlations for applied force atPI. This coefficient is analogous tothose derived from goniometric measurement during SLR test, or lengthmeasurement during FF tests. Our statistical analysis revealed that 0.83 wassignificantly lower than either 0.98 or0.96. Thus some of the superiority inreliability exhibited by SLR and FF
tests appears attributable to the secondfactor, ie the way pain is elicited.
Manual application of accessorymovement seems to be more susceptible to random error than the application of physiological movement. Ourevidence suggests that production ofPAIVM is likely to contain significanterror in comparison to the limited distribution of Rl and R2 over the ROM(Baker 1981, Wong 1981, Weeks 1982,Flint 1983). Biomechanical studies confirm the difficulty facing the therapists.Punjabe et af (1977) have measured4mm displacement between lumbarvertebral bodies when forces of about160N were applied in the anterior direction to the cephalad vertebra in vitro. Collis-Brown (1982) and McNeill(1982) measured maximum forces applied during PAIVM tests of about350N. It is reasonable to assume thatthis load is equally distributed betweenthe intervertebral joints on either sideof the assessed level. This implies thatsimilar loads (350/2 = 175N) were applied by the therapist to lumbar intervertebral joints during PAIVM aswere applied in the in vitro studies ofPunjabe et af (1977). Similar intervertebral displacement would thereforebe expected in the two cases. The invitro observations of Punjabe et af havebeen tentatively confirmed in vivo byThompson (1983) who developed anapparatus for measuring applied loadand relative intervertebral displacementsimultaneously. The apparatus consisted of a proof-ring strain gaugethrough which force was applied centrally to a lumbar vertebra (L3). Twoparallel linear-displacement transducers attached to the strain-gauge andadjusted to contact the spinous processes of vertebrae immediately aboveand below the loaded processes wereused to measure relative displacementbetween L2 and L3 and between L3and L4. Results for three subjects indicated that the caudad joint exhibitedmore displacement (3-5mm) than thecephalad joint (1-3mm) with appliedloads of 250N. Again, if the assumption is made that this load is distributed
equally between the intervertebral jointsabove and below, this represented aforce of 125N at each joint. These datasuggest that therapists are required toproduce very small variations in displacement, sometimes by the application of large forces. Both factors seemconducive to poor performance.
In contrast to the difficulties presented to reliable stimulus productionduring PAIVM, the procedures of FFand SLR tests seem to be taking advantage of a naturally available systemfor amplification of joint movements.Anatomical evidence indicates that relatively gross physiological movementswill produce very small intervertebralmovements. For example, during forward flexion of the trunk, approximately the first 600 is accomplished byspinal structures alone. Farfan (1973)and Allbrook (1957) have shown thatapproximately 120 of this total is contributed by the L5-S1 joint and a further 120 by the L4-L5 joint. The remaining lumbar joints contribute about70 each with the remainder distributedover the relatively immobile thoracicvertebrae. It is a commonly held viewthat for trunk flexion angles less than600, the lumbar joints contribute tothe total in an amount proportional totheir contribution to maximal flexion(although, we have been unable to findquantitative evidence which relates tothis point). According to this model,as the trunk moves through 50, thelower lumbar vertebral joints movethrough an angle of 10 and the higherjoints through about 0.5 0. At the sametime, the shoulders, a distance of 0.5maway from the lumbar vertebral jointsmove through an arc length of about4cm by comparison with the fractionsof millimeters displacement at the jointsthemselves. The amplification effect isquite clear. Similar arguments pertainto structures affected by the SLR.
The implication is that effects wellwithin the control of the therapist'smotor skill (or in the case of activemovement tests within the patient'smotor skill) would produce quite smallchanges at the spine, thus improving
190 The Australian Journal of PhYSiotherapy. Vol. 31, No.5, 1985
Reliability in Clinical Arthrometrics
the signal to noise ratio of t!te manoeuvre.Further possible problems with passiveintervertebral movement tests
The argument has already been putthat although therapists' ability to locate a point in ROM is reasonably good(r = 0.86-0.88), the narrow range ofvariation in compliance parametersplaces particularly high reliability requirements on the therapist, if themeasures are to distinguish phenomenaof interest. To appreciate the difficultyfurther, consider the results obtainedby Weeks (1982) who demonstratedthat an intrasession change of at least170/0 of scale would have to occur inRl for therapists to detect it with 950/0confidence. Since Rl in the populationprobably varies over about a third ofthe scale according to the data of Baker(1981), Weeks (1982) and Flint (1983),intrasession changes exceeding half ofthe total range of individual differences f
in Rl would have to occur for reliabledetection by the therapist. It is equivalent to requiring a joint which is inthe lower quartile of R1 in the population to change to the upper quartile.This seems a very unlikely proposition.The detection of intersession change,or of absolute location in range for Rlor R2, provides an even bleaker picture.
Another aspect of the judgement taskpresented to therapists is identificationof a specific point within ROM. Inassessment of PI, this simply involvesjudgement of the current point in ROMat the time of patient report of pain.In assessments of compliance featuresthis requires identification of the feature and subsequent estimation of thepoint in ROM at which this featureoccurs. Identification of a feature seemsto require that the feature exists inmechanical terms in order to providea stimulus. It also seems to require thatthe feature be definable uniquely interms of the therapists perceptions of'joint feel'. The experiments of Banting (1982) and Mitchell (1983), in whichtherapists were required to performmobilizations to a particular point in
ROM, indicated wide variations intherapists' 'connotations' of Rl andR2, since vastly different forces wereutilized to reach the same point in rangeon the same subject. In the textbook(Maitland, 1977) which established thenomenclature and theory in this field,we have been unable to find a preciseoperational definition of 'resistance'.Therapists with whom we have dIScussed this issue have not been able toreach consensus on a definition. A discussion of the distinctions betweenthese definitions and their implicationsfor the construction of the movementdiagram are, however, beyond thescope of this review.
In the studies reported here specificfeatures of pain or compliance wererecorded on the two dimensional movement diagram. This two dimensionalVAS helps clarify the therapist's assessment task and is recommended forsummarizing and communicating clinical descriptions (Maitland 1977). It isof interest to examine the demands itmakes upon the therapist. For example, the horizonal axis, which scalesROM, is defined by Maitland to represent 'any range of movement fromthe starting position at A to the limitof normal range at B. It makes nodifference whether the movement depicted is small or large . . . Point B isalways constant and always at the extreme of normal average range of passive movement' (Maitland 1977, p.317).This definition shows clearly that thetherapist is not merely required to respond on a psychophysical scale according to current sensory input, a difficult enough task under thecircumstances, but also has to makethat scale relative to 'normal averagerange of movement' .
Several problems may be seen to arisefrom defining the scale relative to normal average range. First, the therapistis required to alter the scale in relationto past experience. This is likely tointroduce a variety of biases (Kahneman et af 1982, Slovic et af 1977).Second, the therapist is apparently required to store many models of nor-
mality, since a different model will berequired for different joints, differentmovements and perhaps other subsetsas well, such as those generated bygender or age. This requirement placesan even larger burden on memory.Third, the parameter for mental modelling is 'average normal range'. Thisseems rather vague, particularly sinceit requires statistical interpretation fromthe observer. Human intuitive perception of the statistical parameters of datasamples suffers from several biases(Slovic et af 1977, Kahneman et af1982). All of these factors are likely toincrease the error of scaling. Nowherein the clinical literature have we beenable to discover evidence that therapistscan in fact cope with such complexityof judgement. Our data, which consistently returned very poor intertherapist correlations, suggest that the taskis too difficult.
A final problem, at least for thecentral PAIVM data reported above,may be seen to arise from the sensoryinformation afforded by the technique.An essential value of passive movements to clinical theory seems to lie inthe highly localized nature of theirprobing. As such the ROM of interestwould appear to be only that which isrelative to adjacent structures, ratherthan the overall movement throughspace described by the segment tested.However, consider the following statement from Maitland (1977, p.34): 'Ifthe pressure is applied as a single slowpressure, the vertebral movement willnot be appreciated at all; if it is appliedtoo quickly it can only be interpretedas shaking. However, if the pressure isthen relaxed and reapplied and repeated two or three times a second, theamount of movement which can takeplace will be readily appreciated'. Asthis statement indicates, the perceptionof relative movement relies not on direct sensation of displacement but onperception of phenomena which are notuniquely determined by relative displacement. As such, the movement diagram seems to place a burden of complex and undefined biomechanical
The Australian Journal of PhYSiotherapy. Vol. 31, No.5, 1985 191
Reliability in Clinical Arthrometrics
interpretation on the therapist, whichwill be conducive to the introductionof error. In fact when direct manualsensation of displacement relative toadjacent segments is not concomitantlyundertaken, the situation we have usually observed to be the case duringPAIVM tests, the movement diagramborders on being a biomechanical nonsequitur to PAIVM. An alternativeVAS more directly defined in terms ofthe sensory experience of the performing therapist may be preferable.
Clinical implicationsAs neither of us is trained with a
clinical background in manual therapywe wish to confine our comments to aseries of questions which the psychometric and biomechanical evidence presented above seem to raise.
Tests of pain have generally been:considered most important in the assessment procedure (Maitland 1977).The reviewed results suggest good ~o
excellent reliability for this aspect ofassessment.
However, the poor reliability shownby tests of vertebral compliance duringpassive movement raisef several questions about their role in clinical practice. Presumably one of the major virtues of passive movement tests ofcompliance is that they help localizethe pathology. However, are there nosatisfactory substitutes for achievingthis goal? It is not yet clear that thesetests would be required, even if reliable, given the plethora of other caseinformation, together with epidemiological knowledge. lull and Bogduk's(1985) results interpreted in the contextof those reported here, suggest thatpain reproduction will very reliably select the level to be treated. How oftendo joint conditions present in the absence of pain? Furthermore, is precision in selection of level to be treatednecessary? If there is no adverse effectassociated with intervention at inappropriate levels, the additional resourcecost involved would appear to be marginal, thus permitting a 'fail-safe' strategy to locality of intervention.
Another role for passive movementtests of compliance seems to be to aidin the selection of a direction and gradeof movement. Again the questionarises, could this decision be made onthe basis of the other information?Furthermore, the research literature hasyet to demonstrate that the grade ofmovement (in the respects defined bycompliance features) selected is criticalto clinical outcome. In any case, bothinter and intratherapist reliability in application of movement grades was demonstrably unreliable. Could the mobilization procedure be made to bemore reliant on patient comfort andparticularly patient feedback ratherthan on manual reassessment followingtreatment? If so, there is ample literature in the experimental psychologyof motor skills which suggest that performance with feedback tends to besuperior (Sage 1977). Perhaps feedback-based treatment, utilizing pain report as feedback, is the de facto modusoperandi and the intertherapist unreliability in the absence of pain merelyconfirms this.
A third role which might be attributed to passive movement tests of compliance is to evaluate progress. If reliability is the criterion for selectingtests of progress, then the evidence presented indicates clearly superior alternatives. The objection may be raisedthat localized compliance changes mustbe uniquely traced. However, the casethat compliance changes per se arepathological or uniquely related to pathology has yet to be definitively outlined in the research literature.
A final role which might be attributed to passive movement tests of compliance in clinical decision strategy isthat of confirmatory tests. A confirmatory test is undertaken to reassurethat a decision taken on another testis adequate. This is a common, butoften misused clinical strategy. If testA correctly predicts a criterion variable(eg pathology of a given type) on 80070of occasions and if test B does likewise,then the final probability of a 'confirmed' decision which is also a correct
decision is actually 64070! This arisesbecause 'confirmation' implies thatboth tests yield the same prediction,thereby invoking the multiplicative lawof contingent probability. On 4070 ofoccasions the tests will confirm eachother, but be simultaneously wrong(0.20 x 0.20 = 0.04). On 16070 of occasons test A will be correct, but testB will disagree (0.80 x 0.20 = 0.16)and on another 16070 vice-versa, making a total of 32070 of occasions containing difficult disagreements. Thesefigures deteriorate if one of these twotests should have a lower percentageof valid predictions.
In conclusion therefore, the obtainedresults suggest that the assessment roleof passive movement tests of compliance be seriously reconsidered, particularly PAIVM in its present form. Ifa case can be made that unique, essential information is provided by the passive assessment of compliance, and wereiterate that such a case has not yetbeen made in accordance with the rigors of empirical science, then it wouldseem that new methods of testing mustbe developed which achieve that purpose.
Present limitations and future directions
The conclusions drawn in this reviewmust be understood in the light of thelimitations imposed by the methodology of the studies providing the evidence for these conclusions. In particular, since we are reporting on anincomplete series of small studies,which of necessity must be limited intheir sampling, several issues requirediscussion.
The number of therapists investigated in anyone study was typicallysmall. However most issues were addressed by more than one study andconsistent results were obtained. Someof the studies reviewed here involvedstudent manual therapists who hadvarying degrees of clinical experiencein physiotherapy practice but who hadnot yet completed their specialist programme in manual therapy. They had
192 The Australian Journal of PhYSiotherapy. Vol. 31, No.5, 1985
Reliability in Clinical Arthrometrics
however completed and passed the unitrelevant to the particular proceduresassessed. Several points can be put forward to argue the case that poor resultswere not the effect of therapist inadequacy or inexperience. Firstly, it mightbe argued that student therapists hadrecently completed a period of veryintensive clinical training and were infact likely to perform better than practising therapists who used some of thesetechniques less frequently. Secondly,when studies utilizing student therapists were replicated with experiencedtherapists, no significant differences inresults were obtained. Thirdly, motorlearning research suggests that whenlearning occurs in the absence of exteroceptive feedback, variability aboutsome mean performance is reduced butthe average performance remains unchanged (Gibson 1969). It is conceivable that upon completion of a periodof formal training the therapists nolonger receive information about thecorrectness of skills employed in practice from a common source and aretherefore continuing to learn in the absence of shared feedback. We mighttherefore expect improvements in testretest reliability in more experiencedtherapists. However, because their experience may have been individualized,it is possible that there will be nochange, or even a deterioration, in intertherapist reliability. Finally, wecould argue that the therapists selectedrepresent a cross section of practisingtherapists and therefore represent thegeneral level of therapeutic skills. Weare unaware of factors which couldhave biased the samples toward the'poor' therapists; in fact, in some cases,efforts were made to involve the morerespected and established members ofthe therapeutic community.
In addition to 'type of therapist'other variables were sampled. Theseinclude anatomical location and assessment technique. Although cervical andthoracic segments were sampled insome studies the lumbar segments wereobserved much more frequently. Thedata from non-lumbar segments col-
lected so far does not suggest that significantly better results for PAIVMtests of compliance will be obtained inthese segments. Finally, it should beclear that since all studies reported areabout spinal joints, no statement canbe made about reliability in the assessment of peripheral joints. Clearly thisaspect requires further investigation,particularly because substantial differences exist between spinal and peripheral joint assessment. For example, inperipheral joints goniometry is morereadily applicable with current techniques. Furthermore, a contralateraljoint is available for simple comparisonin peripheral joints. Contralateral comparisons in spinal joints, when they areappropriate, seem rather more complexbecause both joints belong to theaffected level.
In the assessment of spinal-jointcompliance a number of interestingreliability comparisons remain to beconducted. The PAIVM data collectedto date is limited to central PAIVM.The reliability of unilateral PAIVMseems deserving of investigation sinceamong other differences to centralPAIVM, a contralateral comparison ofsorts is available. In addition most ofthe evidence collected so far relates toPAIVM technique. The reliability ofPPIVM tests, particularly in the cervical spine also seems to deserve furtherinvestigation. In PPIVM the stimulusmovement at the spine may be morecontrollable than in PAIVM becauseof the mechanical advantage argumentinvoked above in the discussion of thesuperior reliability of SLR and FF testsof pain. This could be particularly sofor cervical movement where the therapist has a more manageable structurethan the trunk. Furthermore, unlikePAIVM tests, during PPIVM tests thetherapist is required to directly palpatethe relative movement of adjoining segments in addition to sensing the forcerequired to produce that movement.
A number of lines of research arealso suggested by the results obtained.For example, we have already mentioned that the poor reliability of pas-
sive assessment of compliance featuresindicates that their contribution to theoverall clinical decision process shouldbe carefully assessed. Our group hastaken some initial steps in that direction (Cunningham 1982, Walker 1984).If compliance assessment proves in thefuture to be essential to clinical decision-making more reliable tests willneed to be developed. It may be necessary to develop instrumentedapproaches to this problem. Thompson's study (1983) is a first step in thatdirection in our laboratories. In anycase, such instrumentation will berequired if adequate surveys of spinaljoint compliance are to be completedin order to provide the normative datacurrently missing from the scientificliterature of manual therapy. The poorreliability found for production ofselected grades of movement furtherstrengthens the requirement to investigate the dependence of clinical outcome on particular grades ofmobilization, a requirement initiallyposed by the apparent absence of formal study on this central issue in someapproaches to mobilization.
Manual therapy, at this point in itsdevelopment, is in the position of having developed to the stage of a complexclinical theory well in advance of asound base of verifiable, empiricaldata. It should be clear from the foregoing that even within the narrow aimsselected by the respective investigatorsa great deal remains to be done. It isour hope that the above will prove tobe a seminal contribution in the fieldof clinical arthrometrics.
AcknowledgementWeare indebted to our clinician col
leagues who have been a constantsource of inspiration and feedback; tothe many therapists who have so generously given their time, skills and otherresources; and above all to our students, without whose work this wouldnot have been possible. We hope theywill continue to confront the unknownwith curiosity, rationality and constructive hard work.
The Australian Journal of Physiotherapy. Vol. 31, No.5, 1985 193
Reliability in Clinical Arthrometrics
ReferencesAllbrook D (1957), Movements of the lumbar
spmal column, The Journal of Bone and JomtSurgery, 398, 339-345.
Allen D (1983), The relIabIlity of determImng themost abnormal lumbar jomt usmg paSSIve accessory mtervertebral movements. UnpublIshedPostgraduate DIploma DIssertatIon, LIncoln InstItute of Health Sciences, School of PhySIOtherapy, Melbourne.
Anderson JAD and Sweetman BJ (1975), A combIned fleXI-rule/hydrogomometer for measurement of lumbar spme and Its sagittal movement,Rheumatology and RehabllltatlOn, 14, 173-179.
Bach TM (1985), An mdIrect method for measunng forces applied durmg therapeutIc mterventIOn and assessment techniques. UnpublIshed manuscnpt.
Baker M (1981), Interobserver relIabIlIty of rangeImpaIrmg stiffness ratmgs obtaIned from paSSIveaccessory Intervertebral movements. UnpublIshed Postgraduate DIploma DIssertatIOn, LmcoIn InstItute of Health SCIences, School ofPhysIotherapy, Melbourne.
Banting J (1982), IntertherapIst relIabIlIty In theperformance of a grade II mobilizatIon movement. Unpublished Postgraduate DIploma DISsertatIon, LIncoln Institute of Health SCIences.School of Physiotherapy, Melbourne.
Bantmg JB. Mitchell WN, Bach TM and MatyasTA (1985). RelIabIlIty In the executIOn of selected grades of mobilIzatIOn m manual therapy.UnpublIshed manuscnpt.
Boone DC, Azen SP, Lm CM, Spence C. BaronC and Lee L (1978). RelIabIlity of gomometncmeasurement, PhYSical Therapy, 58, 1355-1360.
BreIg A and Troup JDG (1979), BIOmechamcalconSIderations in the straIght-leg-raIsmg test,Spme, 4, 242-250
Bogduk N (1985), A SCIentifIc ai>proach to cervIcaldIagnosIs, Proceedmgs of the AustralianPhySIOtherapy AssoclQtlOn Conference, Bnsbane.
Bruce P (1981), The test-retest relIability of physiological movement as a method of assessmgrange of movement to the level of pam tolerance. UnpublIshed Postgraduate Diploma DissertatIon, Lmcoln InstItute of Health SCIences,School of PhYSIotherapy, Melbourne.
Clarkson M (1982), Intertherapist reliabIlIty massessmg stiffness ratmgs m the lumbar spmeobtaIned from passIve phYSIOlogical mtervertebral movements. UnpublIshed Postgraduate DIploma DIssertatIOn, Lincoln InstItute of HealthSCIences, School of PhysIotherapy, Melbourne.
Cohen J (1960), A coeffiCIent of agreement fornommal scales, EducatIOnal and PsychologicalMeasurement, 20, 37-46.
CollIs-Brown GL (1982), Test retest relIabIlIty ofpam onset ratmgs obtaIned from paSSIve accessory mtervertebral movements. UnpublIshedPostgraduate DIploma DissertatIOn, Lmcoln InstItute of Health SClenr,es, School of PhySIOtherapy, Melbourne.
Cunmngham G (1982), ClImcal deCIsion makmgm mampulatIve therapy: the effect of antecedentmformatIOn on palpatIOn findmgs. UnpublIshedPostgraduate Diploma DIssertation, Lmcoln InstItute of Health SCIences, School of PhysIOtherapy, Melbourne.
Cyriax J (1982), Textbook of Orthopaedic MedIcme, Vol. 1. (9th ed.), Bailliere Tmdall, London.
DePalma AF and Rothman RH (1970), The Intervertebral DISC, WB Saunders, Philadelphia.
Edwards AL (1964), Statistics for the BehaVIOuralSCiences, Holt, Rmehart and Wmston, NewYork
Ekstrand J, Wlktorsson M, Oberg Band GIllqUIStJ (1982), Lower extremity gomometnc measurements: a study to determme their relIabilIty,Archives of PhySical Medlcme and RehabilltatlOn, 63, 171-175.
Farfan HF (1973), Mechamcal DISorders of theLow Back, Lea and Feblger, PhIladelphIa
Flelss JL, Cohen J and Eventt BS (1969), Largesample standard errors of kappa and weightedkappa, Psychological Buffetm, 72, 323-327IT
Flmt R (1983), IntertherapIst relIabIlIty for theassessment of jomt measurement behaVIor bymeans of passive accessory mtervertebral movements (PAIVMs). UnpublIshed PostgraduateDIploma DissertatIon, LIncoln InstItute ofHealth SCIences, School of Physiotherapy, Melbourne.
GIbson E (1969), Prmclples of Perceptual Learnmg and Development, Appleton-Century-Crofts,New York.
Goddard MD and ReId JD (1965), Movementsmduced by straIght leg raIsmg m the lumbosacral roots, nerves and plexus, and the mtrapelVIC sectIon of the SCiatIC nerve, Journal ofNeurology, Neurosurgery and Psychiatry, 28,12-18
Gonnella C, ParIS SV and Kutner M (1982), RelIabIlIty m evaluatmg paSSIve mtervertebral motIOn, PhYSical Therapy, 62, 436-444.
Grant R (1980), Lumbar sagittal mobIlIty m hypermobIle mdlvlduals. Proceedmgs of the Mampulatlve Therapy ASSOCiatIOn of A ustrallQ,AdelaIde.
Gnsold PM (1983), Estlmatmg range of movementfrom passive accessory Intervertebral movements; the nature of the scale and the relIabIlItyof performance. UnpublIshed Postgraduate DIploma DissertatIOn, Lmcoln Institute of HealthSCIences, School of PhYSIotherapy, Melbourne.
GUIlford JP (1954), PsychometrIC Methods,McGraw-HIll, New York, chs 13. 14.
Hanley EN, Matter RE and Frymoyer JW (1976),Accurate roentgenographIc determInatIOn oflumbar fleXIon-extenSIOn, Cllmcal OrthopaediCSand Related Research, 115, 145-148.
Hart FD. Stnckland D and ClIffe P (1974), Measurement of spmal mobIlIty, Annals of Rheumatic DISeases, 33, 136-139.
Hartman DP (1977), ConSIderations m the chOIceof mter-observer relIabIlIty estimates, Journalof Applzed BehaVIOr AnalySIS, 10, 103-116.
Hoehler FK and Tobls JS (1982), Low back paInand ItS treatment by SpInal mampulatIOn: measures of fleXIbilIty and asymmetry, Rheumatology and RehabilitatIOn, 21, 21-26.
Hollenbeck AR (1978). Problems of relIabIlIty mobservatIOnal research. m GP Sackett (Ed.),Observmg behaVIOur, Vol 2: Data collectIOn andAnalysIS Methods, Umverslty Park Press, Baltimore.
Hubert L (1977), Kappa reVISIted. PsychologicalBulletm, 84, 289-297.
Johnston WL (1982), PaSSIve gross motIOn testIng:Part I. Its role m phYSIcal exaffilmatIOn, Journalof the Amerzcan OsteopathiC ASSOCiatIOn, 81,298-303.
Johnston WL, ElkIss ML, Manno RV and BlumGA (1982a), PaSSIve gross motion testmg: PartII. A study of mterexamIner agreement, Journalof the Ameflcan OsteopathiC ASSOCiatIOn, 81,304-308.
Johnston WL, Beal MC, Blum GA, Hendra JL,Neff DR and Rosen ME (l982b), Passive grossmotIon testmg Part III Exammer agreementon selected subjects, Journal of the AmerzcanOsteopathiC ASSOCiatIOn, 81, 309-313
Jull G (1978), ClImcal observatIOns of upper cervIcal mobIlity, Proceedmgs of the Inaugural Congress of the Mampulatlve Therapy ASSOCiatIOnof A ustralla, Sydney
Jull G (1982). PaSSIve mtervertebral movementsof the lumbar spme, m Toward a better understandmg of spmal paIn. Proceedmgs of theManzpulatlve Therapy ASSOCiatIOn of AustraliaAnnual Conference, Bnsbane
Jull GA and Lane MB (1983). Aspects of lumbarspme mobIlity m a normal population, In KDBower (Ed.) InternatIOnal Conference on Mampulatlve Therapy Proceedmgs, Perth
Jull GA and Bogduk N (1985), Manual examination: An ObjectIve test of cerVIcal Jomt dysfunctIOn, Proceedmgs of the A ustrallan PhysIOtherapy AssoclQton Conference, Bnsbane
Kahneman D. Slovlc P and Tversky A (Eds) (1982),Judgement Under Uncertamty: Heuflst,cs andBiases, CambrIdge Umverslty Press, Cambndge
Kaltenborn F and LIndahl 0 (1969), ReprodUCIbilIty of the results of manual mobIlIty testIngof speCIfic mtervertebal segments, Lakartldnmgen (SwedIsh Medical Journal), 66, 962-965
KapandjI IA (1974). The PhYSIOlogy of the Jomts,Vol 3, (2nd ed.) LIvmgstone, Edmburgh.
Kwong HF (1981), Test-retest relIabilIty of pamonset assessed by actIVe 'physIOlogical' movements. UnpublIshed Postgraduate DIploma DISsertatIon, Lmcoln InstItute of Health SCIences,School of PhYSIOtherapy, Melbourne.
Lankhorst GJ. Van de Stadt RJ, Vogelaar TW,Van der Korst JK and Prevo AJH (1982), ObJectIVity and repeatabIlIty of measurements Inlow back paIn, Scandmavlan Journal of Rehabilitative Medlcme, 14, 21-26.
LeIghton JR (1955), Instrument and techmc formeasurement of range of jomt motion. ArchiVesof PhYSical Medlcme and RehabilitatIOn, 36,571-578.
Loebl WY (1967), Measurement of spmal postureand range of spmal movement, Annals ofPhysIcal Medlcme, 9, 103-110.
Macfarlane A (1981), Test-retest relIabIlIty ofstraIght leg raIse as determined by paIn onset.UnpublIshed Postgraduate DIploma DIssertatIon, Lmcoln InstItute of Health SCIences. Schoolof PhYSIOtherapy, Melbourne.
Macrae IF and Wnght V (1969), Measurement ofback movement,Annals of Rheumatic DISeases,28. 584-589.
MaItland GD (1977), Vertebral Mampulatlon, (4thed.). Butterworth, London.
McNeIll KE (1982), Intertheraplst relIabIlIty ofpaIn onset from passive accessory mtervertebralmovements. UnpublIshed Postgraduate DIploma DIssertatIon, Lmcoln InstItute of HealthSCIences. Schoolof PhysIotherapy, Melbourne.
Meyers LS and Grossen NE (1974). BehaVIOralResearch: Theory, Procedure, DeSign. WHFreeman and Co., San FranCISCO, 164-166.
MillIon R, Hall W. Haavlk NIlsen K, Baker RDand Jayson MIV (1982), Assessment of theprogress of the back paIn patIent, Spme, 7, 204212.
MIllman AJ (1981), Test-retest relIabIlIty of rangeImpaIrmg stiffness ratmgs obtaIned from paSSIveaccessory mtervertebral movements. Unpub-
194 The Australian Journal of PhYSiotherapy. Vol. 31, No.5, 1985
Reliability in Clinical Arthrometrics
IIshed Postgraduate DIploma DIssertatIon, LmcoIn InstItute of Health SCIences, School ofPhysIOtherapy, Melbourne
MItchell WN (1983), RelIabIlIty m the performanceof Grade II and Grade IV moblhzatlons. UnpublIshed Postgraduate DIploma DIssertatIOn,Lmcoln Institute of Health SCIences, School ofPhysIotherapy, Melbourne.
Moll J and Wnght V (1976), Measurement ofspmal movement, m M Jayson (Ed.), The Lumbar Spme and Back Pam, Sector, London
Moran HM, Hall MA, Barr A and Ansel BM(1979), Spmal mobIhty m the adolescent, Rheumatology and RehabilItatIOn, 18, 181-185.
Munro R (1983), The contnbution of pam versusmemory for pOSItIon and exteroceptIve feedbackm the forward flexIon test. UnpublIshed Postgraduate DIploma DIssertatIon, Lmcoln InstItute of Health SCIences, School of PhySIOtherapy, Melbourne.
Murphy RW (1977), Nerve roots and spmal nervesm degeneratIve dISC dIsease, Cilmcal OrthopaediCS and Related Research, 129, 46-57
Myers H (1961), Range of motion: Part I mtroductory revIew of literature, Physical Therapy Reviews, 29, 195-205.
Nunally JC (1978), PsychometriC theory, (2nd ed.),McGraw-HIll, New York.
O'Keefe PJ (1981), The spmal complIance testJ~Unpubhshed Postgraduate DIploma DIssertatIon, Lmcoln InstItute of Health SCIences, Schoolof PhysIotherapy, Melbourne.
Patterson S (1982), The test-retest rehabllIty oflumbar fleXIon when hmIted by pam. Unpubhshed Postgraduate DIploma DIssertatIon, LmcoIn InstItute of Health SCIences, School ofPhyslOthrapy, Melbourne.
Puentedura L (1983), The effects of trunk pOSItIonon straight leg raise m normal subjects. Unpubhshed Postgraduate DIploma DIssertatIon,Lmcoln InstItute of Health SCIences. School ofPhysIotherapy, Melbourne.
PunJabe MM, Krag MH, WhIte AA and SouthWIck WO (1977), Effect of preload on loaddIsplacement curves of the lumbar spme, Orthopaedic ClImcs of North America, 8, 181192.
Reynolds PM (1975), Measurement of spmal mobIhty: a comparIson of three methods, Rheumatology and RehabIlitatIOn, 14, 180-185.
Sage GH (1977), IntroductIOn to Motor BehaVIOr:A Neuro-psychologlcal Approach, (2nd ed.),AddIson Wesley, Readmg, Massaechusetts, ch20.
SIOVIC P, FIschhoff Band Llchtenstem S (1977),BehaVIOral deCISIon theory, Annual ReView ofPsychology, 28, 1-39.
Stoddard A (1980), kfanual of OsteopathiC Techmque, (3rd ed.), Hutchmson, London.
Thompson R (1983), Measurement of relative mtervertebral dIsplacement m the lumbar spinedunng applicatIon of a PAIVM. UnpublishedPostgraduate DIploma DIssertatIon, Lincoln In-
stttute of Health SCIences, School of Physiotherapy, Melbourne.
Troup JGD, Hodd CA and Chapman AE (1967),Measurements of the sagittal mobility of thelumbar spine and hips, Annals ofPhYSical MedIcme, 9, 308.
Twomey LT and Taylor JF (1979), A descriptionof two new mstruments for measuring the rangesof sagIttal and honzontal plane motIons in thelumbar regIon, AustralIan Journal of PhysIOtherapy, 25, 201-203.
Van Adrichem JA and Van Der Korst JK (1973),Assessment of the fleXIbility of the lumbar spine,ScandmaVian Journal of Rheumatology, 2, 8791.
Walker D (1984), A survey of treatment selectionand subjectIve certainty at different stages ofclImcal assessment. Unpublished PostgraduateDIploma DissertatIOn, Lincoln Institute ofHealth SCIences, School of Physiotherapy, Melbourne.
Weeks PM (1982), Test-retest reliability of stiffness onset using passive accessory intervertebralmovements. Unpublished Postgraduate Diploma Dissertation, Lincoln Institute of HealthSCIences, School of Physiotherapy, Melbourne.
Wong M (1981), Interobserver reliabihty of stiffness onset ratings obtained from passive accessory mtervertebral movements. UnpublishedPostgraduate Diploma Dissertation, Lincoln Institute of Health SCIences, School of Physiotherapy, Melbourne.
where X o = X o - Xo ' the deviation of the observed rawscore from the mean of the observed scores;x t = X t - X t , the deviation of the true score from themean of the true scores; and e = Eo - E the deviation
these circumstances. Constant error does affect the truthof the absolute value, but the difference between twoobserved scores will be equal to the true score difference.However, if the error is random, measurements will varyunpredictably even when the same true value is underobservation. The quantitative theory of reliability is concerned therefore with random error.
Since E may vary from one occasion of measurement tothe next, a consequent problem is how to summarize the'typical' size of E. Furthermore, the interest usually lies indescribing how reliable an observation process is for avariety of objects which lie on a common dimension, ratherthan in describing the reliability for measuring only oneobject. This also requires the definition of a method forindexing the 'typical' value of error. Thus in estimatingerror, a sample of values is usually generated. Hence theissue of 'typical' error is a problem in sampling theory andthe associated descriptive statistics.
If a sample consisting of one measurement of severalobjects is taken, then each score could be expressed as adeviation from the sample mean rather than in raw scoreunits. Equation (A.2) then follows from (A.]):
AppendixReliability theory is a highly developed field with ample
presentation of its concepts (Guilford 1954, Edwards 1964,Nunally 1978). This appendix will only review selectedissues of interest to a number of the studies reported inthis review. A knowledge of basic statistical theory (mean,variance, correlation, statistical inference) is assumed inthe following discussion.
The reliability of a measurement process refers to thedependability, or reproducibility of observed scores whenthese are obtained from measurements of the same events.Realiability classically relates the extent to which observedscores represent the true values of the events measured.Equation (A.]), where X o = observed score X t = truescore and E = error component, shows that the observedvalue can be represented as being partly composed of truequantity and partly error.
X o = X t + E (A.])
If X t is known the discrepancy of X ° readily quantifiesthe error. The larger E is the more unreliable is the observation.
Two patterns of error can occur: systematic error, suchthat E is constant; and random error, such that E isunpredictably variable from measurement to measurement.If the error is constant, then observed scores will be thesame across several measurements of a given true value.The instrument is therefore not considered unreliable under
X o = X t + e (A.2)
The Australian Journal of PhYSIOtherapy. Vol. 31, No.5, 1985 195
Reliability in Clinical Arthrometrics
The average squared deviation from the mean is knownas the variance, or S2. Therefore, the variance of observedscores is composed of true variance plus error variance:
Summing over the sample and dividing by the numberof cases yields the averages:
Ex2 E(x~ + e2 + 2x te)__0 = _
Ed; + E(e - e,)2Ed; + (Ee2 + Ee' 2 + 2ee '
Ed~
.'. s~o
If n is the number of pairs of observed scores, then:Ed~ E7 (Ee2 Ee' 2)--=-+--+--
n n n n
do X o - X o(XI - XI) + (e - e')
Note that E2dl (e - e') = 2d t E(e - e') andE(e - e') = Ee - Ee'. Since both e and e' are randomwithin (and between) the measurelnent samples, thenEe = Ee' = 0 and E(e -e') = 0 (within the limits ofsampling error) following the earlier argument. Thus E2d t(e - e') = 0 and:
If the true difference In deviation units isd, = (x, - XI)
then:
Again, since e and e' are random, the positive andnegative components will be equal (within the limits ofsampling error). Thus E2ee' = 0 and:
d~ =:;; [d I + (e e ' )J2d~ + (e - e,)2 + 2d t (e - e')
. Ed~ Ed; + E(e - e,)2 + E2d, (e - e')
the random error In deviation units, then It follows from(A.2) that:
(A. 7)
(A. 6)
(A.3)
(A.4)
(A.5)nn
X~ = (x t + e)2
x~ = x; + e2 + 2x te)
Ex~ ~ Ee2 E2x te-=-+-+--n n n n
of the error component from the mean of the error components.
Since the problem is to obtain a measure of 'typical'amount of random error, the deviation scores could beaveraged over the sample. However, if error is random,there will be just as much positive deviation as negativedeviation, yielding a misleading average of zero. To overcome this, statisticians deal with squared deviation, whichhas the effect of removing the algebraic sign. The meansquared deviation score will not average to zero. In deviation score units the average may be obtained as follows:
That is,
Since the error is randomly positive and negative in equalquantity, over the total sample Elxt ewill tend to be zero,as in the earlier argument. n
Therefore, EX~ Ex; Ee2--=-4--
n n n
Hence,
This index is more readily interpreted and is commonlycited.
Frequently the interest lies in measuring change fromone occasion to another. In these situations each of thetwo measurements will introduce some error. If do is theobserved difference score in deviation units, X o is the observed deviation score on the second occasion and e' is
Since s; is the amount of squared error (in deviationunits) per case it seems to be an adequate measure of'typical' error.
However, there are several drawbacks to using s; as thesole index of reliability. One is that the units of error aresquared, which makes interpretation awkward. This is easily resolved by defining the squared root of s; to be the'standard error of measurement':
Consequently the error of measuring change will be largerthan the error for measuring on either occasion. The standard error of measuring changes (Se dill) will be:
(A. 10)se dill = .Js.; + s;,The standard error of measurement however is a measure
of 'typical' error. The error will sometimes be less, sometimes more. Most often, it is assumed that error is variablein both direction and magnitude, with small errors moreprobably than large errors. Although situations may arisewhere other assumptions are better, it is unusual to imaginethat the errors around a true value are normally distributed.Thus the mean of a sample of observed values of the sameevent will be the best estimate of the event's true score. Ifthe errors around this true value follow the assumed normaldistribution it is possible to calculate over what range somespecified proportion of observed values will fall. This statistic is known as the confidence interval (eI):
(A.9)
(A.8)
196 The Australian Journal of PhYSiotherapy Vol 31, No 5, 1985
Reliability in Clinical Arthrometrics
Furthermore, since the same events are being measuredtwice, within the limits of sampling error the two samplesX and Y should have the same variance, s; = s;. The
Yo = Y t + E (A.I3)All of equations (A.2) - (A.I2) can be rewritten for
these second measurements. Since reliability can be definedas the extent to which measul ements predict remeasurements of the same events, the correlation between X andY will be an index of reliability. The correlation coefficientis defined as the average cross-product of the standardizedscore on X and Y:
measurements composed of several observations a partscore from a subset of the observations may be comparedto a part-score based on another subset (internal consistency). These are all different practical methods for obtaining two estimates of the same underlying true value.Although the error introduced in attempting reobservationby different methods are likely to be different, all thesepractical approaches to establishing reliability have in common the need to quantify the degree to which one set ofobservations~ predicts another set of observations of thesame events.
The correlation coefficient r (Edwards 1964) is a measureof the degree to which one data set predicts another. Ifthe sample of events is remeasured (eg on another occasion,or by another observer), then equation (A.I3) relates theobserved scores Yo the true scores Y t and the error E:
(A. 14)EZx Zy
r = ---n
We will assume that the reader is already familiar withthe theory of correlation, which indicates how this indexrelates to scattergrams; and how it varies between 0 (whenX and Yare randomly related) and 1.0 (when X, Y coordinates plot perfectly on a straight line).
An algebraically equivalent equation for r can be writtenin deviation scores since Z x = (X - X) / Sx andZy (Y - Y)/Sy :
r = Exo Yovi Ex~Ey~ (A.15)
If ex and ey are the error deviation scores for X and Yrespectively, then:
Exy = E(x, + ex) (Yt + ey )
= Ex,y,+ Extey + Ey,ex + Eexey
Since ex and ey are random (with positive and negativevalues equivalent and randomly paired to particular x t ' S
or y t' s) it follows that Extey = 0, Ey tex = 0, and also thatEexey = O. Thus EsoY 0 = Ex,yt. Since the same eventsare being remeasured x t = Y t and therefore
Exoyo = Ex; = Ey; (A. 16)
(A. 11)
where (1 - a) is the confidence level and Z a is the appropriate value from the normal distribution. An analogousequation can be written for difference scores by substitutingSe dlff for Se' The virtue of transforming a standard errorinto confidence intervals is that it acknowledges the errorto be variable and permits calculation of the proportionsof observations which will occur within some given errorrange, or vice-versa. It thus more completely models theerror of measurement.
Another drawback to both s; and Se is that they aremetric bound indexes. That is, standard errors of variousmeasures are not readily comparable: different approachesto measurement must often be compared; measurementunits are sometimes arbitrary; the comparative reliabilityof measurement in different fields is an issue at times. Inthese cases a unit-free index of reliability is preferable.Percentages or proportions are often used to resolve sucha problem. From (A.B) it follows that:
2 2~+~ = 1
s~ s~ (A. 12)
Thus s;/ s~ is the proportion of observed score variancedue to true score variance and s;/s~ is the proportion dueto error score variance. The former may be defined as acoefficient of reliability. For a perfectly errorless measurement method s; = O. Thus s; = s; and the reliabilitycoefficient s;/ s~ will be 1. As s;/s~ increases so the reliability diminishes. For a measurement method which ismaximally errorful all the observed variance is error variance, ie s; = s~. In this case s; = 0 and the coefficientof reliability will be zero.
How then to estimate S e and its associated statistics inpractice? Clearly one way might be to measure a set ofevents whose values are known, then calculate s~ and s;from observed and known values. From this, s~ and itsderivatives s e " SedIff" the reliability coefficient and various confindence intervals could be obtained.
Unfortunately, in practice, particularly in new fields ofmeasurement, this is often impossible, since the true valuesare not known. However, although the true values are notknown, it can be safely assumed that if a variety of eventsare measured, some variation in true scores should occur.If these events are measured again the initial values shouldbe exactly reproduced provided there is no error. To theextent that there is random error the relative position amongthe initial observations will not be reproduced.
It should be noted that failure to reproduce scores canresult from several processes. Instruments or observers maybe unstable over time (test-retest unreliability). Measurestaken by two observers may differ (interobserver unreliability). Measures taken by two versions of the same instrument or test may differ (parallel form unreliability). In
The Australian Journal of PhySIotherapy. Vol. 31, No 5, 1985 197
Reliability in Clinical Arthrometrics
(A.20)r =
(A.22)
(A.21)
,(Sis)R = J/
The preceding discussion is concerned with the theoryof reliability as it applies to variables measured on intervalor ratio scales such as might occur in goniometry. Oftenclinical measurement is categorical in nature, such as whenrating abnormality, or when rating the stiffness of a jointalong a five point scale. A reliability theory needs to bedefined for these situations also.
A frequently employed measure for describing test-retestor interobserver reliability of categorical data is the per-
In equation (2I)R = correlation for uncurtailed distribution, S = standard deviation of uncurtailed distribution,, = correlation of the curtailed distribution, s = standarddeviation of the curtailed distribution.
Another conclusion derived from reliability theory whichis relevant to the main text is that the observed correlationbetween two variables will be less than the theoreticallypossible correlation between their true scores. This occursbecause both variables are measured with some randomerror. If the reliability coefficients for both variables areknown, the theoretically possible relationship between thetwo variables when measured without error can be calculated (2). If X and Yare the two variables, r XtYt = thecorrelation between the true X and Y scores, , x Y =observed correlation between X and Y, rxx = the reliabilitycoefficient for measuring X and r yy = the reliability coefficient for measuring Y, then:
Imagine an experiment where a blindfolded human subject is required to palpate lO cubes, the sides of which varyin 5mm steps from lOmm sides to 55mm. The cubes arethen repalpated. The subject is required to judge their sizeon both occasions and a reliability coefficient is calculated.Conversely imagine the same experiment with 10 cubesvarying in Imm steps from 20mm to 29mm. Since the samepalpatory technique is employed on similar events the random error of measurement in metric terms e should remaincomparable (within the limits of sampling variation). ThusSe is assumed constant across the two experiments. However the true score variance will be larger in the firstexperiment with cubes ranging from IOmm to 55mm. Itfollows from equation (A.20) that if s; diminishes whens; remains constant, then the ratio r will diminish also. Itis therefore important that reliability studies use stimuliwith a range of variability which is representative of theevents to which the instrument or observational procedurewill be ultimately applied. If a range restriction does occura correction is available:
(A.l9)
(A. 18)
(A.17)
- ,
S;r =
s~
Ex~ = Ey~
Note that both' and So are readily calculated from experimental data.
The preceding theory outlines the rationale and interpretational basis of the major classical indexes for quantifying reliability: the standard error of measurement (se) ;the reliability coefficient (r); and the confidence interval(eI) around the true score (or around the change score).A number of further conclusions are derivable from theforegoing. An entire exposition of these is beyond the scopeof this contribution. However two aspects are importantto arguments present~d in the main text.
One aspect is that the reliability coefficient is sensitiveto the amount of true score variance. If the true scorevariance is for some reson restricted the reliability coefficient will be reduced provided the error of measurementremains constant. This conclusion follows from equations(8) and (18). Since, = s;ls~ and s~ = s; + s; then:
,
variance being the average squared deviation it followsthatEx~ IN = EYoln.That is:
where s; = true score variance and s; observed scorevariance. Equation (A.I8) allows the very important conclusion that the correlation between two measures of thesame sample of events is in fact the reliability coefficientdefined from equation (A.I2).
This conclusion not only enhances the interpretation ofreliability and its evaluation in practice, but also permitsevaluation of the other useful index of reliability s and itsderivative the confidence interval. From equation (A.I2) itfollows that 1 - r = s; /s~. Thus s; = s~ (1 - ,) andtherefore:
From equations (A.I6) and (A.I7) equation (A. 18) may berewritten:
Dividing both nominators and denominators by n defines, in terms of variances:
Ex;ln Ey;ln, = Ex~/n = Ey~/n
198 The Australian Journal of Physiotherapy. Vol. 31, No.5, 1985
Reliability in Clinical Arthrometrics
Kappa expresses observed agreement Po relative to expectedagreement. It also expresses that difference as a proportionof the distance between random and perfect agreement.Thus kappa is very similar to the reliability coefficient. IfPo is 100070, then K = 1,. if Po = P e', then K = o. AsPo exceeds P e so kappa grows. Although the analogybetween k and r is limited, a number of problems arepractically resolved by this statistic. The probability distribution of kappa has been investigated (Fleiss and Cohen1969, Hubert 1977) and it is a method with relevance to awide variety of reliability problems when categorical datais encountered (Hartman 1977, Hollenbeck 1978). Othercorrelation-like statistics, such as <P, are applicable to problems of association in categorical data, but a full discussionof their relative values is beyond the scope of this appendix.
(A.23)K=
rating. Let a, b, e, d be the proportion of ratings in therespective categories obtained on the first round of measurements and let a', b', e', d' be the correspondingproportions on the second round. If agreement is definedas not only the conjuction of identical ratings, but also ofadjacent ratings, then elementary probability theory concludes that Pe = aa' + ab' + ba' + bb' + be' + ee'+ cd' + dc' + dd'. In the above example a = a' =0.1,b = b ' = 0.4, c = e' = 0.4, d = d' = 0.1. ThereforeP e = 0.82, which is substantially larger than 0.34, theresult obtained with the stricter agreement rule.
To overcome such disadvantages Cohen (1960) definedthe statistic kappa:
Po - P e1 - P e
centage of agreement between the two sets of observations.The rationale is similar to that of the reliability coefficient:the presence of error will reduce agreement. Althoughsimple and widely used, percent agreement has some deficiencies.
Even if measurement is totally unreliable, that is if theobseI ved categories arose at random, there will be somedegree of agreement. Furthermore, that degree will beinfluenced by the distribution of measurements which arisefrom random processes. These distributions are differentin different circumstances.
For example, the number of categories in the scale willinfluence randomly obtained agreement levels. On a twopoint scale, if both responses are equiprobable the expectedagreement rate is 0.50. On a four point scale, if all responsesare equiprobable, the expected agreement rate is 0.25.
In addition, the assumption of equiprobability may beinappropriate. If the incidence of the two middle categoriesin the four category example was 0.4 for each and if theincidence for the two extreme categories was 0.1 for each,then basic probability theory indicates the percentage ofexpected agreement (Pe) would be P e = (.1 x .1) + (.1x .1) + (.4 x .4) + (.4 x .4), tllat is 0.34. The twodistributions of marginal probability, furthermore, neednot be identical as in this example. Nevertheless, basicprobability theory can readily Yield expected proportionsof agreement under the random model.
Another factor which can influence the proportion ofagreement under a random model is the definition of agreement. Let A, B, C, D be the four categories of the above
The Australian Journal of PhYSiotherapy. Vol. 31, No.5, 1985 199