3up

Statistical Analysis ofMethod Comparison studies

Bendix Carstensen Steno Diabetes Center, Gentofte, Denmark& Dept. Biostatistics, Medical Faculty,

University of Copenhagen

http://BendixCarstensen.com

Claus Thorn Ekstrm Statistics, Faculty of Life Sciences,University of Copenhagenwww.statistics.life.ku.dk/~ekstrom/

Tutorial, SISMEC, Ancona, Italy28 September 2011

http://BendixCarstensen.com/MethComp/Ancona.2011

Comparing two methods withone measurement on eachMorning

Bendix Carstensen

MethComp28 September 2011Tutorial, SISMEC, Ancona, Italy


Comparing measurement methods

General questions:

Are results systematically dierent?

Can one method safely be replaced by another?

What is the size of measurement errors?

Dierent centres use dierent methods ofmeasurement: How can we convert from onemethod to another?

How precise is the conversion?

Comparing two methods with one measurement on each 1/ 90

Two methods for measuring fat content inhuman milk:

1 2 3 4 5 6

12

34

56

Gerber

Trig

Therelationshiplooks like:

y1 = a+ by2


Two methods one measurement by each

How large is the dierence between a measurementwith method 1 and one with method 2 on a(randomly chosen) person?

Di = y2i y1i, D, s.d.(D)Limits of agreement:

D 2 s.d.(D)95% prediction interval for the dierence between ameasurement by method 1 and one by method 2.[1, 2]


Limits of agreement: Interpretation

If a new patient is measured once with each ofthe two methods, the dierence between thetwo values will with 95% probability be withinthe limits of agreement.

This is a prediction interval for a (future)dierence.

Requires a clinical input:Are the limits of agreement suciently narrowto make the use of either of the methodsclinically acceptable?

Is it relevant to test if the mean is 0?


Limits of agreement: Test?

Testing whether the dierence is 0 is a bad idea:

If the study is suciently small this will beaccepted even if the dierence is important.

If the study is suciently large this will berejected even if the dierence is clinicallyirrelevant.


Limits of agreement:

1 2 3 4 5 6

0.4

0.2

0.0

0.2

0.4

( Trig + Gerber ) / 2

Trig

G

erbe

r

0.17

0.00

0.17 Plotdierences(Di) versusaverages(Ai).


Model in Limits of agreement

Methods m = 1, . . . ,M , applied to i = 1, . . . , Iindividuals:

ymi = m + i + emiemi N (0, 2m) measurement error

Two-way analysis of variance model, withunequal variances in columns.

Dierent variances are not identiable withoutreplicate measurements for M = 2 because thevariances cannot be separated.

Models 7/ 90


Usually interpreted as the likely dierence betweentwo future measurements, one with each method:

y2 y1 = D = 2 1 1.96 s.d.(D)Normally we use 2 instead of 1.96.

Neither are formally correct if we take the modelseriously:

Use a t-quantile with I 1 d.f. Estimation s.d. of 2 1 is /

I.

So we should use t0.95

(I + 1)/I instead.This is 2.08 for I = 30 and less than 2 if I > 85.

Models 8/ 90


Limits of agreement can be converted to aprediction interval for y2 given y1, by solving for y2:

y2 y1 = 2 1 2 s.d.(D)which gives:

y2|1 = y2|y1 = 2 1 + y1 2 s.d.(D)

Models 9/ 90

Introduction to computingMorning

Bendix Carstensen



Structure of practicals

This tutorial is both theoretical and practical, i.e.the aim is to convey a basic understanding of theproblems in method comparison studies, but also toconvey practical skills in handling the statisticalanalysis.

R for data manipulation and graphics.

So we assume familiarity with R.

Occasionally BUGS for estimation in non-linearvariance component models.

BUGS is hidden inside an R-function.

Introduction to computing 10/ 90

How it works

Example data sets are included in the MethComppackage.

Functions in MethComp are based on a data framewith a particular structure; a Meth object:

meth method (factor)item item, person, individual, sample (factor)repl replicate (if present) (factor)

y the actual measurement (numerical)

Once converted to Meth, just use summary, plotetc.


How it looks:

> subset(ox,as.integer(item) subset(to.wide(ox),as.integermeth item repl y Note:

1 CO 1 1 78.0 Replicate measurements are t2 CO 1 2 76.4 item repl id CO pulse3 CO 1 3 77.2 1 1 1 1.1 78.0 714 CO 2 1 68.7 2 1 2 1.2 76.4 725 CO 2 2 67.6 3 1 3 1.3 77.2 736 CO 2 3 68.3 4 2 1 2.1 68.7 68184 pulse 1 1 71.0 5 2 2 2.2 67.6 67185 pulse 1 2 72.0 6 2 3 2.3 68.3 68186 pulse 1 3 73.0187 pulse 2 1 68.0188 pulse 2 2 67.0189 pulse 2 3 68.0


Getting your own data into R

Take a look in The R Primer by Claus Ekstrm,or:

If your data are not too large, the simplest is to edityour data in Excel or some other spreadsheet tolook like this:

item repl id CO pulse1 1 1.1 78.0 711 2 1.2 76.4 721 3 1.3 77.2 732 1 2.1 68.7 682 2 2.2 67.6 672 3 2.3 68.3 68

The rst line is variable names; the following linesare data.


Analysis options in this course

Scatter plots.

Bland-Altman plots ((y2 y1) vs. (y1 + y2)/2) Limits of Agreement (LoA).

Models with constant bias.

Models with linear bias.

Conversion formulae between methods (singlereplicates)

Transformation of measurements.

Plots of conversion equations.

Reporting of variance components.


Requirements

R for data manipulation and graphics. Keep a script of what you did:

Use the built-in editor in R the nerds can use ESS or you can download R-Studio.

You need the packages: MethComp R2WinBUGS coda BRugs Epi - Version 1.10 !!!


Functions in the MethComp package

5 broad categories of functions in MethComp:

Graphical exploring data.

Data manipulation reshaping and changing.

Simulation generating datasets or replacingvariables.

Analysis functions tting models to data.

Reporting functions displaying results fromanalyses.

Overview of these in the Practicals.


Does it work?

library(MethComp)library(help=MethComp) # Do you have version 1.10??data(ox)ox

Limits of agreement assumptions

The dierence between methods is constant

The variances of the methods (and hence ofthe dierence) is constant.

Check this by:

Regress dierences on averages.

Regress absolute residuals from this on theaverages.

Non-constant dierence 18/ 90

Glucose measurements

4 6 8 10 12 14

4

6

8

10

12

14

Venous plasma

Cap

illar

y bl

ood


Glucose measurements

5 6 7 8 9 10 11 12

4

2

0

2

4

( capil + plasma ) / 2

capi

l p

lasm

a

2.84

0.41

2.02


Regress dierence on average

Di = a+ bAi + ei, var(ei) = 2D

If b is dierent from 0, we could use this equation toderive LoA:

a+ bAi 2Dor convert to prediction as for LoA:

y2|1 = y1 + a+ bAi y1 + a+ by1 = a+ (1 + b)y1Exchanging methods would give:

y1|2 = a+ (1 b)y1instead of: y1|2 =

a1 + b

+1

1 + by1


Variable limits of agreement

5 6 7 8 9 10 11 12

4

2

0

2

4


capi

l p

lasm

a


Improving the regression of D on A

y2i y1i = a+ b(y1i + y2i)/2 + eiy2i(1 b/2) = a+ (1 + b/2)y1i + ei

y2i =a

1 b/2 +1 + b/2

1 b/2y1i +1

1 b/2ei

y1i =a

1 + b/2+

1 b/21 + b/2

y2i +1

1 + b/2ei

This is what comes out of the functionsDA.reg and BA.plot


Variable limits of agreement

5 6 7 8 9 10 11 12

4

2

0

2

4


capi

l p

lasm

a

capilplasma = 2.09 0.32 (capil+plasma)/2 (95% p.i.: +/2.16)

plasma = 2.49 + 1.38 capil (95% p.i.: +/2.57)capil = 1.80 + 0.73 plasma (95% p.i.: +/1.87)


Conversion equation with prediction limits

4 6 8 10 12 14

4

6

8

10

12

14

Venous plasma

Cap

illar

y bl

ood


Prediction intervals

Prediction s.e. for y1|2 is /(1 b/2) Prediction s.e. for y1|2 is /(1 + b/2) The slope of the prediction line is the ratio ofthe prediction s.e.s.

Hence prediction limits can be used both ways:


Conversion equation with prediction limits

4 6 8 10 12 14

4

6

8

10

12

14

Venous plasma

Cap

illar

y bl

ood


Why does this work?

The general model for the data is:

y1i = 1 + 1i + e1i, e1i N (0, 21)y2i = 2 + 2i + e2i, e2i N (0, 22)

Work out the prediction of y1 given anobservation of y2 in terms of these parameters.

Work out how dierences relate to averages interms of these parameters.

Then the prediction is as we just derived it.


Why is it wrong anyway?

Introducing linear bias, ymi = m + mi + emiputs measurements by dierent methods ondierent scales.Hence it has formally no meaning to form thedierences.

In the induced model for Di a+ bAi + ei,ei and AI are not independent.

But if is not too far from 1 it not a bigproblem, though.


Comparing two methods withreplicate measurementsMorning

Bendix Carstensen



Replicate measurementsFat data; exchangeable replicates:

item repl KL SL1 1 4.5 4.91 2 4.4 5.01 3 4.7 4.83 1 6.4 6.53 2 6.2 6.43 3 6.5 6.1

Oximetry data; linked replicates:

item repl CO pulse1 1 78.0 711 2 76.4 721 3 77.2 732 1 68.7 682 2 67.6 672 3 68.3 68

Linked or exchangeable replicates!Comparing two methods with replicate measurements 30/ 90

Extension of the model:exchangeable replicates

ymir = m + i + cmi + emirs.d.(cmi) = m matrix-eect

s.d.(emir) = m measurement error

Replicates within (m, i) are needed to separate and .

Even with replicates, the separate s are onlyestimable if M > 2.

Still assumes that the dierence betweenmethods is constant.

Assumes exchangeability of replicates.Comparing two methods with replicate measurements 31/ 90

Extension of the model:linked replicates

ymir = m + i + air + cmi + emirs.d.(air) = between replicates

s.d.(cmi) = m matrix-eect


Still assumes that the dierence betweenmethods is constant.

Replicates are linked between methods:air is common across methods, i.e. the rstreplicate on a person is made under similarconditions for all methods (i.e. at a specicday or the like).

Comparing two methods with replicate measurements 32/ 90

Replicate measurements

Three approaches to limits of agreement withreplicate measurements:

1. Take means over replicates within each methodby item stratum.

2. Replicates within item are taken as items.

3. Fit the correct variance components model anduse this as basis for the LoA.The model is tted using:> BA.est( data, linked=TRUE ).


Oximetry data

20 40 60 80

20

10

010

20

(CO+pulse)/2

CO

pul

se



The limits of agreement should still be fordierence between future single measurements.

Analysis based on the means of replicates istherefore wrong:

Model:

ymir = m + i + air + cmi + emir

var(y1jr y2jr) = 21 + 22 + 21 + 22 note that the term air air cancels becausewe are referring to the same replicate.


Wrong or almost right

In the model the correct limits of agreement wouldbe:

1 2 1.96 21 +

22 +

21 +

22

But if we use means of replicates to form thedierences we have:

di = y1i y2i = 1 2 +

r airR1i

r airR2i

+c1i c2i +

r e1irR1i

r e2irR2i


The terms with air are only relevant for linkedreplicates in which case R1i = R2i and therefore theterm vanishes. Thus:

var(di) = 21+

22+

21/R1i+

22/R2i <

21+

22+

21+

22

so the limits of agreement calculated based on themeans are much too narrow as prediction limits fordierences between future single measurements.


(Linked) replicates as items

If replicates are taken as items, then the calculateddierences are:

dir = y1ir y2ir = 1 2 + c1i c2i + e1ir e2irwhich has variance 21 +

22 +

21 +

22, and so gives

the correct limits of agreement. However, thedierences are not independent:

cov(dir, dis) = 21 +

22

Negligible if the residual variances are very largecompared to the interaction, variance likely to beonly slightly downwards biased.


Recommendations

Fit the correct model, and get the estimatesfrom that, e.g. by using BA.est.

If you must use over-simplied methods:

Use linked replicates as item.

If replicates are not linked; make a randomlinking.Note: If this give a substantially dierentpicture than using the original replicatenumbering as linking key, there might besomething shy about the data.

Further details, see [3].


Oximetry data

Linkedreplicates usedas items

Mean overreplicates asitems

Limits based onmodel dashed lineassumingexchangeablereplicates

20 40 60 80

20

10

010

20

(CO+pulse)/2

CO

pul

se


Repeatability andreproducibilityWednesday 9 February, morning

Bendix Carstensen



Accuracy of a measurement method

Repeatability:The accuracy of the method under exactlysimilar circumstances; i.e. the same lab, thesame technician, and the same day.(Repeatability conditions)

Reproducibility:The accuracy of the method under comparablecircumstances, i.e. the same machinery, thesame kit, but possibly dierent days orlaboratories or technicians.(Reproducibility conditions)

Repeatability and reproducibility 41/ 90

Quantication of accuracy

Upper limit of a 95% condence interval forthe dierence between two measurments.

Suppose the variance of the measurement is 2:

var(ymi1 ymi2) = 22

i.e the standard error is2, and a condnece

interval for the dierence:

0 1.962 = 0 2.772 2.8

This is called the reproducibility coecient orsimply the reproducibility. (The number 2.8 isused as a convenient approximation).


Quantication of accuracy

Where do we get the ?

Repeat measurements on the same item (oreven better) several items.

The conditions under which the repeat(replicate) measurements are taken determineswhether we are estimating repeatability orreproducibility.

In larger experiments we must consider theexchangeability of the replicates i.e. whichreplicates are done under (exactly) similarconditions and which are not.


Linear bias between methodsMorning

Bendix Carstensen



Extension with non-constant bias

ymir = m + mi + random eects

There is now a scaling between the methods.

Methods do not measure on the same scale therelative scaling is estimated, between method 1 and2 the scale is 2/1.

Consequence: Multiplication of all measurements onone method by a xed number does not changeresults of analysis:

The corresponding is multiplied by the samefactor as is the variance components for thismethod.

Linear bias between methods 44/ 90

Variance components

Two-way interactions:

ymir = m + m(i + air + cmi) + emir

The random eects cmi and emir have variancesspecic for each method.

But air does not depend on m must be scaled toeach of the methods by the corresponding m.

Implies that = s.d.(air) is irrelevant the scaleis arbitrary. The relevant quantities are m thebetween replicate variation within item as measuredon the mth scale.


0 20 40 60 80 100

020

4060

8010

0

y1

0 20 40 60 80 100

020

4060

8010

0

y2

0 20 40 60 80 100

020

4060

8010

0

y1

y2

ymir = m + i + air + cmi + emir


0 20 40 60 80 100

020

4060

8010

0

y1

0 20 40 60 80 100

020

4060

8010

0

y2

0 20 40 60 80 100

020

4060

8010

0

y1

y2



Estimation: Alternatingrandom eects regressionMorning

Bendix Carstensen



Alternating random eects regression

Carstensen [4] proposed a ridiculously complicatedapproach to t the model

ymir = m + mi + cmi + emir

based in the observation:

For xed the model is a linear mixed model.

For xed (, ) it is a regression through 0.

This has be improved in [5]

Alternating regressions 48/ 90

Alternating random eects regression

Now consider instead the correctly formulatedversion of the slightly more general model:


Here we observe

For xed mir = i + air + cmi the model is alinear model, with residual variances dierentbetween methods.

For xed (, ) scaled responses y are used:

ymir mm

= i + air + cmi + emir/m


Estimation algorithm


1. Start with mir = ymi2. Estimate (m, m).

3. Compute the scaled responses and t therandom eects model.

4. Use the estimated is, and BLUPs of air andcmi to update mir.

5. Check convergence in terms of identiableparameters.


The residual variances

The variance components are estimated in themodel for the scaled response.

The parameters (m, m) are not taken intoaccount in the calculation of the residualvariance.

Hence the residual variances must be correctedpost hoc.

This machinery is implemented in the functionAltReg in the MethComp package.


> AR.ox round(AR.ox,3)From

To Intercept: CO pulse Slope: CO pulse IxR sd. MxI sd. res.sd.CO 0.000 -2.159 1.000 1.063 3.521 2.978 2.055pulse 2.031 0.000 0.941 1.000 3.313 2.802 4.079


Your turn:

Start on the practical titled:

Oximetry: Linked replicates with non-constantbias


Converting between methodsAfternoon

Bendix Carstensen



Predicting method 2 from method 1

y10r = 1 + 1(0 + a0r + c10) + e10ry20r = 2 + 2(0 + a0r + c20) + e20r

y20r = 2 +

21

(y10r 1 e10r)+ 2(c10 + c20) + e20r

The random eects have expectation 0, so:

E(y20|y10) = y20 = 2 + 21

(y10 1)

Converting between methods 54/ 90

y20r = 2 +21

(y10r 1 e10r)+ 2(c10 + c20) + e20r

var(y20|y10) =(21

)2(21

21 +

21) + (

22

22 +

22)

The slope of the prediction line from method 1 tomethod 2 is 2/1.

The width of the prediction interval is:

2 2(

21

)2(21

21 +

21) + (

22

22 +

22)


If we do the prediction the other way round (y1|y2)we get the same relationship i.e. a line with theinverse slope, 1/2.

The width of the prediction interval in this directionis (by permutation of indices):

2 2

(2121 +

21) +

(12

)2(22

22 +

22)

= 2 2 12

(21

)2(21

21 +

21) + (

22

22 +

22)

i.e. if we draw the prediction limits as straight linesthey can be used both ways.


20 40 60 80 10020

40

60

80

100

CO

puls

e

pulse = 2.11 + 0.94 CO ( 6.00 )

CO = 2.25 + 1.06 pulse ( 6.39 )


What happened to the curvature?

0 5 10

20

10

010

2030

40

x

y

Usually the predictionlimits are curved:

y|x t0.975 1 + xx

In our prediction we have ignored the last term(xx), i.e. eectively assuming that there is noestimation error on 2|1 and 2|1.


Transformation of dataAfternoon

Bendix Carstensen



If variances are not constantA transformation might help:

> round( ftable( DA.reg(ox) ), 3 )alpha beta sd.pred beta=1 s.d.=K

From: To:CO CO 0.000 1.000 NA NA NA

pulse 1.864 0.943 5.979 0.142 0.000pulse CO -1.977 1.061 6.342 0.142 0.000

pulse 0.000 1.000 NA NA NA

> oxt round( ftable( DA.reg(oxt) ), 3 )alpha beta sd.pred beta=1 s.d.=K

From: To:CO CO 0.000 1.000 NA NA NA

pulse -0.034 0.900 0.306 0.009 0.246pulse CO 0.038 1.111 0.340 0.009 0.246

pulse 0.000 1.000 NA NA NA

Transformation of data 59/ 90

20 40 60 80 10020

40

60

80

100

CO

puls

e

pulse = 2.11 + 0.94 CO ( 6.00 )

CO = 2.25 + 1.06 pulse ( 6.39 )


Analysis on the transformed scale

> ARoxt ARoxt ARoxtNote: Response transformed by: log p/(100 - p)

Conversion between methods:alpha beta sd

To: From:CO CO 0.000 1.000 0.202

pulse 0.042 1.105 0.341pulse CO -0.038 0.905 0.309

pulse 0.000 1.000 0.271

Variance components (sd):s.d.

Method IxR MxI resCO 0.232 0.160 0.143pulse 0.210 0.145 0.191

This is an analysis for the transformed data.Transformation of data 62/ 90

Backtransformation for plotting

prpulse

0 20 40 60 80 100

40

20

0

20

40

( CO + pulse ) / 2

puls

e

CO

0 20 40 60 80 100

40

20

0

20

40

( CO + pulse ) / 2

puls

e

CO

pulse = 2.03 + 0.94 CO ( 6.01 )

CO = 2.16 + 1.06 pulse ( 6.38 )

0 20 40 60 80 100

40

20

0

20

40

( CO + pulse ) / 2

puls

e

CO


Implementation in BUGSAfternoon

Bendix Carstensen



Implementation in BUGS


Non-linear hierarchical model:Implement in BUGS.

The model is symmetrical in methods.

Mean is overparametrized.

Choose a prior (and hence posterior!) for thes with nite support.

Keeps the chains nicely in place.

This is the philosophy in the function MCmcmc.

Implementation in BUGS 67/ 90

Results from tting the model

The posterior distn of (m, m, i) is singular.

But the relevant translation quantities areidentiable:

2|1 = 2 12/12|1 = 2/1

So are the variance components.

Posterior medians used to devise predictionequations with limits.


The MethComp package for R

Implemented model:


Replicates required. R2WinBUGS, BRugs or JAGS is required. Dataframe with variablesmeth, item, repl and y (a Meth object)

The function MCmcmc writes a BUGS-program,initial values and data to les.

Runs BUGS and sucks results back in to R, andgives a nice overview of the conversionequations.


Example output: Oximetry

> summary( ox )#Replicates

Method 1 2 3 #Items #Obs: 354 Values: min med maxCO 1 4 56 61 177 22.2 78.6 93.5pulse 1 4 56 61 177 24.0 75.0 94.0

>> MCox

Simulation run of a model with- method by item and item by replicate interaction:- using 4 chains run for 2000 iterations(of which 1000 are burn-in),

- monitoring all values of the chain:- giving a posterior sample of 4000 observations.

model is syntactically correctdata loadedmodel compiledInitializing chain 1: initial values loaded but this or anotherInitializing chain 2: initial values loaded but this or anotherInitializing chain 3: initial values loaded but this or anotherInitializing chain 4: initial values loaded but this or anotherinitial values generated, model initializedSampling has been started ...1000 updates took 38 sdeviance setmonitor set for variable alphamonitor set for variable betamonitor set for variable sigma.mimonitor set for variable sigma.irmonitor set for variable sigma.resmonitor set for variable devianceImplementation in BUGS 71/ 90

> MCox

Conversion between methods:alpha beta sd

To: From:CO CO 0.000 1.000 1.740

pulse -9.342 1.159 5.328pulse CO 8.061 0.863 4.508

pulse 0.000 1.000 6.115


Variance components (sd):s.d.

Method IxR MxI resCO 3.878 3.122 1.230pulse 3.222 2.757 4.324

Variance components with 95 % cred.int.:method CO pulseqnt 50% 2.5% 97.5% 50% 2.5% 97.5%

SDIxR 3.878 3.053 4.533 3.222 2.426 3.930MxI 3.122 2.193 9.764 2.757 1.915 5.902res 1.230 0.143 2.639 4.324 3.709 5.019tot 5.220 4.507 10.645 6.135 5.457 7.849


Mean parameters with 95 % cred.int.:50% 2.5% 97.5% P(>0/1)

alpha[pulse.CO] 8.057 -2.457 29.884 0.969alpha[CO.pulse] -9.346 -49.949 2.476 0.031beta[pulse.CO] 0.863 0.604 0.997 0.024beta[CO.pulse] 1.159 1.003 1.657 0.976

Note that intercepts in conversion formulae are adjusted to getconversion formulae that represent the same line both ways,and hence the median interceps in the posterior do not agreeexactly with those given in the conversion formulae.


Inter-rater agreementAfternoon

Claus Thorn Ekstrm



Program

Example

Random rater vs. xed methods

Statistical modelling

Inter-rater agreement 75/ 90

Example: depression ratings

DoctorPatient A B C D E1 8 9 5 8 82 0 0 1 2 13 4 5 5 5 34 5 8 7 8 55 3 3 1 8 26 8 9 9 9 1

Doctor doctor! Am I depressed?

Getting second opinions ... and third ... and fourth...Inter-rater agreement 76/ 90

Example: depression ratings

Research question

How well will two doctors agree on the diagnosis?

In this example we use humans as measurementmethods or raters.

However, we are not interested in makingstatements about specic raters.


Fixed versus random eects

Denition: Factors can either be xed or random.

A factor is xed when the levels (e.g. raters)under study are the only levels of interest.

A factor is random when the levels under studyare a random sample from a larger populationof raters and the goal of the study is to make astatement regarding the larger population.

Raters can be dened as xed or random factors:

If the raters themselves are of interest (youwant to use them again) then use xed model.

If raters are randomly chosen of possible poolof raters (you do not have specic raters inmind) then use the random model.Inter-rater agreement 78/ 90

Fixed versus random eects


Modelling: exchangeable replicates

The model for xed methods is:

ymir = m + i + cmi + emirs.d.(cmi) = m matrix-eect




Assumes that the dierence between methodsis constant.

Assumes exchangeability of replicates.

If no replicates then disregard the cmis.Inter-rater agreement 80/ 90

Modelling: exchangeable replicates

The model for random methods/raters is:

ymir = bm + i + cmi + emirs.d.(bm) = variation among raters





Note: average dierence is 0! Assumes exchangeability of replicates.

If no replicates then disregard the cmis.Inter-rater agreement 81/ 90

Model for replicate measurements

Same approach as before: Fit the correct variancecomponents model and use this as the basis for LoA.

Extremely exible. Can even be used to analyze the situationwhere every rater not necessarily has scoredevery item.

Exchangeable replicates are not uncommon, e.g.,

Experts scoring/extracting information fromimages

Measurements taken on couples/twins.

Linked replicates do not make sense, when it isarbitrary which person is partner 1 or partner 2.



The limits of agreement / prediction interval for tworandom raters scoring a new future observation is

0 1.96

22Extra variation

+ 21 + 22 +

21 +

22

However, since we are considering the predictioninterval for two random raters we use the averagevariance components in the formula

0 1.96

2(2 + 2 + 2)

Note that the expected dierence is zero since wehave no xed order of the raters.


Example: Stress scoring of dogs

10 judges scoring stress indicators from 10 dogs.

> dogdata BA.est(dogdata, random.raters=TRUE, linked=FALSE)

Variance components (sd):IxR MxI M res

j1 0 18.145 14.11 20.948j10 0 8.122 14.11 12.736j2 0 0.009 14.11 11.350j3 0 0.004 14.11 9.524j4 0 0.004 14.11 9.614j5 0 8.924 14.11 12.588j6 0 18.534 14.11 21.135j7 0 0.023 14.11 11.991j8 0 0.004 14.11 9.384j9 0 0.003 14.11 9.789


Example: Stress scoring of dogs

10 judges scoring stress indicators from 10 dogs.

> res res$LoA

Mean Lower Upper SDRand. rater - rater 0 -61.02451 61.02451 30.51225


Linked replicates

For linked replicates, extend the model as before:

ymir = bm + i + air + cmi + emirs.d.(bm) = variation among raters

s.d.(air) = between replicates



The variation between replicates, , does not enterthe limits-of-agreement since the LoAs are for asingle new future observation (ie., the samereplicate from one item/individual for both raters).

0 1.96

2(2 + 2 + 2)Inter-rater agreement 86/ 90

Linked replicates

> dogdata BA.est(dogdata, random.raters=TRUE)

Variance components (sd):IxR MxI M res

j1 6.466 18.754 13.994 21.672j10 6.466 8.163 13.994 10.454j2 6.466 3.213 13.994 10.439j3 6.466 0.011 13.994 5.718j4 6.466 0.033 13.994 11.716j5 6.466 8.817 13.994 10.449j6 6.466 19.205 13.994 21.770j7 6.466 4.732 13.994 10.523j8 6.466 0.009 13.994 5.701j9 6.466 0.026 13.994 11.441


Linked replicates

> res2 res2$LoA

Mean Lower Upper SDRand. rater - rater 0 -60.47317 60.47317 30.23658


Random raters

Fit the correct variance component modelwhere variation among raters is considered arandom eect

Since each rater can have his/her individualvariance we need to average the individualvariance components

Extract the relevant variance components andcompute the limits-of-agreement


DG Altman and JM Bland.Measurement in medicine: The analysis of method comparison studies.The Statistician, 32:307317, 1983.

JM Bland and DG Altman.Statistical methods for assessing agreement between two methods of clinicalmeasurement.Lancet, i:307310, 1986.

B Carstensen, J Simpson, and LC Gurrin.Statistical models for assessing agreement in method comparison studies withreplicate measurements.International Journal of Biostatistics, 4(1):Article 16, 2008.

B Carstensen.Comparing and predicting between several methods of measurement.Biostatistics, 5(3):399413, Jul 2004.

B. Carstensen.Comparing Clinical Measurement Methods: A practical guide.Wiley, 2010.

3up

Documents