an introduction to statistical issues and methods in ......an introduction to statistical issues and...

Statistics SurveysVol. 0 (0000)ISSN: 1935-7516DOI: 10.1214/154957804100000000

An Introduction to Statistical Issues

and Methods in Metrology for Physical

Science and Engineering∗

Stephen Vardeman†, Michael Hamada‡, Tom Burr‡, Max Morris†,Joanne Wendelberger‡, J. Marcus Jobe§, Leslie Moore‡, and

Huaiqing Wu¶

Abstract: Statistical science and physical metrology are inextricably in-tertwined. Measurement quality affects what can be learned from datacollected and processed using statistical methods, and appropriate datacollection and analysis quantifies the quality of physical measurements.Metrologists have long understood this and have often developed theirown statistical methodologies and emphases. Some statisticians, notablythose working in standards organizations such as the National Institute ofStandards and Technology (NIST), at national laboratories, and in qual-ity assurance and statistical organizations in manufacturing concerns havecontributed to good measurement practice through the development of sta-tistical methodologies appropriate to the use of ever-more-complicated mea-surement devices and to the quantification of measurement quality. In thisregard, contributions such as the NIST e-Handbook [16], the AIAG MSAManual [1], and the text of Gertsbakh [8] are meant to provide guidancein statistical practice to measurement practitioners, and Croarkin’s review[4] is a nice survey of metrology work done by statisticians at NIST.

Our purpose here is to provide a unified overview of the interplay be-tween statistics and physical measurement for readers with roughly a firstyear graduate level background in statistics. We hope that most parts ofthis will also be accessible and useful to many scientists and engineers withsomewhat less technical background in statistics, providing introduction tothe best existing technology for the assessment and expression of measure-ment uncertainty. Our experience is that while important, this material isnot commonly known to even quite competent statisticians. So while wedon’t claim originality (and surely do not provide a comprehensive sum-mary of all that has been done on the interface between statistics andmetrology), we do hope to bring many important issues to the attentionof a statistical community broader than that traditionally involved in workwith metrologists.

We refer to a mix of frequentist and Bayesian methodologies in thisexposition (the latter implemented in WinBUGS (see [14]). Our rationale isthat while some simple situations in statistical metrology can be handled bywell-established and completely standard frequentist methods, many othersreally call for analyses based on nonstandard statistical models. Develop-

∗Development supported by NSF Grant DMS No. 0502347 EMSW21-RTG awarded to theDepartment of Statistics, Iowa State University. LA-UR 11-04736†Statistics and IMSE Departments, Iowa State University, Ames, IA 50011, USA‡Statistical Sciences CCS-6, Los Alamos National Laboratory, PO Box 1663, MS-F600,

Los Alamos, NM 87545, USA§Decision Sciences and Management Information Systems, Miami University, Oxford, OH

45056, USA¶Department of Statistics, Iowa State University, Ames, IA 50011, USA

0

imsart-ss ver. 2011/05/20 file: StatSurvey*Stat*Metrology*Notes.tex date: September 21, 2011

Vardeman et al./Statistics and Physical Metrology 1

ment and exposition of frequentist methods for those problems would haveto be handled on a very case-by-case basis, whereas the general Bayesianparadigm and highly flexible WinBUGS software tool allow de-emphasis ofcase-by-case technical considerations and concentration on matters of mod-eling and interpretation.

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1 Basic Concepts of Measurement/Metrology . . . . . . 21.2 Probability and Statistics and Measurement . . . . . . 61.3 Some Complications . . . . . . . . . . . . . . . . . . . . . 6

2 Probability Modeling and Measurement . . . . . . . . . . . . 82.1 Use of Probability to Describe Empirical Variation and

Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 High-Level Modeling of Measurements . . . . . . . . . 10

2.2.1 A Single Measurand . . . . . . . . . . . . . . . . 102.2.2 Measurands From a Stable Process or Fixed

“Population” . . . . . . . . . . . . . . . . . . . . . 112.2.3 Multiple Measurement “Methods” . . . . . . . 132.2.4 Quantization/Digitalization/Rounding and Other

Interval Censoring . . . . . . . . . . . . . . . . . 142.3 Low-Level Modeling of “Computed” Measurements . 15

3 Simple Statistical Inference and Measurement (Type A Un-certainty Only) . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.1 Frequentist and Bayesian Inference for a Single Mean 173.2 Frequentist and Bayesian Inference for a Single Stan-

dard Deviation . . . . . . . . . . . . . . . . . . . . . . . 213.3 Frequentist and Bayesian Inference from a Single Dig-

ital Sample . . . . . . . . . . . . . . . . . . . . . . . . . . 233.4 Frequentist and Bayesian “Two-Sample” Inference for

a Difference in Means . . . . . . . . . . . . . . . . . . . . 253.5 Frequentist and Bayesian “Two-Sample” Inference for

a Ratio of Standard Deviations . . . . . . . . . . . . . . 283.6 Section Summary Comments . . . . . . . . . . . . . . . 29

4 Bayesian Computation of Type B Uncertainties . . . . . . . 295 Some Intermediate Statistical Inference Methods and Mea-

surement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.1 Regression Analysis . . . . . . . . . . . . . . . . . . . . . 32

5.1.1 Simple Linear Regression and Calibration . . . 325.1.2 Regression Analysis and Correction of Measure-

ments for the Effects of Extraneous Variables . 365.2 Shewhart Charts for Monitoring Measurement Device

Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.3 Use of Two Samples to Separate Measurand Standard

Deviation and Measurement Standard Deviation . . . 385.4 One-Way Random Effects Analyses and Measurement 40



5.5 A Generalization of Standard One-Way Random Ef-fects Analyses and Measurement . . . . . . . . . . . . . 42

5.6 Two-Way Random Effects Analyses and Measurement 435.6.1 Two-Way Models Without Interactions and Gauge

R&R Studies . . . . . . . . . . . . . . . . . . . . 445.6.2 Two-Way Models With Interactions and Gauge

R&R Studies . . . . . . . . . . . . . . . . . . . . . 455.6.3 Analyses of Gauge R&R Data . . . . . . . . . . 46

6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . 466.1 Other Measurement Types . . . . . . . . . . . . . . . . . 466.2 Measurement and Statistical Experimental Design and

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486.3 The Necessity of Statistical Engagement . . . . . . . . 48

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

1. Introduction

1.1. Basic Concepts of Measurement/Metrology

A fundamental activity in all of science, engineering, and technology is measure-ment. Before one can learn from empirical observation or use scientific theoriesand empirical results to engineer and produce useful products, one must be ableto measure all sorts of physical quantities. The ability to measure is an obviousprerequisite to data collection in any statistical study. On the other hand, sta-tistical thinking and methods are essential to the rational quantification of theeffectiveness of physical measurement.

This article is intended as an MS-level introduction for statisticians to metrol-ogy as an important field of application of and interaction with statistical meth-ods. We hope that it will also provide graduate-level scientists and engineers(possessing a substantial background in probability and statistics) access tothe best existing technology for the assessment and expression of measurementuncertainty.

We begin with some basic terminology, notation, and concerns of metrology.

Definition 1. A measurand is a physical quantity whose value, x, is of interestand for which some well-defined set of physical steps produce a measurement,y, a number intended to represent the measurand.

A measurand is often a feature or property of some particular physical object,such as “the diameter” of a particular turned steel shaft. But in scientific studiesit can also be a more universal and abstract quantity, such as the speed of lightin a vacuum or the mass of an electron. In the simplest cases, measurands areunivariate, i.e., x ∈ R (and often x > 0), though as measurement technologyadvances, more and more complicated measurands and corresponding measure-ments can be contemplated, including vector or even functional x and y (such



as mass spectra in chemical analyses). (Notice that per Definition 1, a measur-and is a physical property, not a number. So absolutely precise wording wouldrequire that we not call x a measurand. But we will allow this slight abuse oflanguage in this exposition, and typically call x a measurand rather than employthe clumsier language “value of a measurand.”)

The use of the different symbols x and y already suggests the fundamental factthat rarely (if ever) does one get to know a measurand exactly on the basis of realworld measurement. Rather, the measurand is treated as conceptually unknownand unknowable and almost surely not equal to a corresponding measurement.There are various ways one might express the disparity between measurand andmeasurement. One is in terms of a simple arithmetic difference.

Definition 2. The difference e = y − x will be called a measurement error.

The development of effective measurement methods is the development ofways of somehow assuring that measurement errors will be “small.” This, forone thing, involves the invention of increasingly clever ways of using physicalprinciples to produce indicators of measurands. (For example, [15] is full of dis-cussions of various principles/methods that have been invented for measuringphysical properties from voltage to viscosity to pH.) But once a relevant physicalprinciple has been chosen, development of effective measurement also involvessomehow identifying (whether through the use of logic alone or additionallythrough appropriate data collection and analysis) and subsequently mitigatingthe effects of important physical sources of measurement error. (For exam-ple, ISO standards for simple micrometers identify such sources as scale errors,zero-point errors, temperature induced errors, and anvil parallelism errors asrelevant if one wants to produce an effective micrometer of the type routinelyused in machine shops.)

In the development of any measurement method, several different ideas of“goodness” of measurement arise. These include the following.

Definition 3. A measurement or measuring method is said to be valid if itusefully or appropriately represents the measurand.

Definition 4. A measurement system is said to be precise if it produces smallvariation in repeated measurement of the same measurand.

Definition 5. A measurement system is said to be accurate (or sometimesunbiased) if on average (across a large number of measurements) it producesthe true or correct value of a measurand.

While the colloquial meanings of the words “validity,” “precision,” and “ac-curacy” are perhaps not terribly different, it is essential that their technicalmeanings be kept straight and the words used correctly by statisticians andengineers.

Validity, while a qualitative concept, is the first concern when developing ameasurement method. Without it, there is no point in proceeding to considerthe quantitative matters of precision or accuracy. The issue is whether a methodof measurement will faithfully portray the physical quantity of interest. When



developing a new pH meter, one wants a device that will react to changes inacidity, not to changes in temperature of the solution being tested or to changesin the amount of light incident on the container holding the solution.

Precision of measurement has to do with getting similar values every time ameasurement of a particular quantity is obtained. It is consistency or repeata-bility of measurement. A bathroom scale that can produce any number between150 lb and 160 lb when one gets on it repeatedly is really not very useful. Afterestablishing that a measurement system produces valid measurements, consis-tency of those measurements is needed. Precision is largely an intrinsic propertyof a measurement method or system. After all possible steps have been taken tomitigate important physical sources of measurement variation, there is not reallyany way to “adjust” for poor precision or to remedy it except 1) to overhaul orreplace measurement technology or 2) to average multiple measurements. (Someimplications of this second possibility will be seen later, when we consider simplestatistical inference for means.)

But validity and precision together don’t tell the whole story regarding theusefulness of real-world measurements. The issue of accuracy remains. Does themeasurement system or method produce the “right” value on average? In or-der to assess this, one needs to reference the system to an accepted standardof measurement. The task of comparing a measurement method or system toa standard one and, if necessary, working out conversions that will allow themethod to produce “correct” (converted) values on average is called calibra-tion. In the United States, the National Institute of Standards and Technology(NIST) is responsible for maintaining and disseminating consistent standards(items whose measurands are treated as “known” from measurement via thebest available methods) for calibrating measurement equipment.

An analogy that is sometimes helpful in remembering the difference betweenaccuracy and precision of measurement is that of target shooting. Accuracy intarget shooting has to do with producing a pattern centered on the bull’s eye(the ideal). Precision has to do with producing a tight pattern (consistency).

In Definitions 4 and 5, the notions of “repeated measurement” and “average”are a bit nebulous. They refer somewhat vaguely to the spectrum of measure-ment values that one might obtain when using the method under discussion inpursuit of a single measurand. This must depend upon a careful operationaldefinition of the “method” of measurement. The more loosely this is defined(for example by allowing for different “operators” or days of measurement, orbatches of chemical reagent used in analysis, etc.) the larger the conceptual spec-trum of outcomes that must be considered (and, for example, correspondinglythe less precise in the sense of Definition 4 will be the method).

Related to Definition 5 is the following.

Definition 6. The bias of a measuring method for evaluating a measurand isthe difference between the measurement produced on average and the value ofthe measurand.

The ideal is, of course, that measurement bias is negligible. If a particularmeasurement method has a known and consistent bias across some set of mea-



surands of interest, how to adjust a measurement for that bias is obvious. Onemay simply subtract the bias from an observed measurement (thereby produc-ing a higher-level “method” with no (or little) bias). The possibility of suchconstant bias has its own terminology.

Definition 7. A measurement method or system is called linear if it has nobias or if its bias is constant in the measurand.

Notice that this language is a bit more specialized than the ordinary math-ematical meaning of the word “linear.” What is required by Definition 7 is notsimply that average y be a linear function of x, but that the slope of that lin-ear relationship be 1. (Then, under Definition 7, the y-intercept is the bias ofmeasurement.)

Because a measurement method or system must typically be used across time,it is important that its behavior does not change over time. When that is true,we might then employ the following language.

Definition 8. A measurement method or system is called stable if both itsprecision and its bias for any measurand are constant across time.

Some measurements are produced in a single fairly simple/obvious/directstep. Others necessarily involve computation from several more basic quanti-ties. (For example, the density of a liquid can only be determined by measuringa weight/mass of a measured volume of the liquid.) In this latter case, it iscommon for some of the quantities involved in the calculation to have “uncer-tainties” attached to them whose bases may not be directly statistical, or ifthose uncertainties are based on data, the data are not available when a mea-surement is being produced. It is then not absolutely obvious (and a topic oflater discussion in this exposition) how one might combine information fromdata in hand with such uncertainties to arrive at an appropriate quantificationof uncertainty of measurement. It is useful in the discussion of these issues tohave terminology for the nature of uncertainties associated with basic quantitiesused to compute a measurement.

Definition 9. A numerical uncertainty associated with an input to the com-putation of a measurement is of Type A if it is statistical/derived entirely fromcalculation based on observations (data) available to the person producing themeasurement. If an uncertainty is not of Type A, it is of Type B.

The authors in [17] state that “Type B evaluation of standard uncertainty isusually based on scientific judgment using all the relevant information available,which may include

• previous measurement data,• experience with, or general knowledge of, the behavior and property of

relevant materials and instruments,• manufacturer’s specifications,• data provided in calibration and other reports, and• uncertainties assigned to reference data taken from handbooks.”



1.2. Probability and Statistics and Measurement

The subjects of probability and statistics have intimate connections to the the-ory and practice of measurement. For completeness and the potential aid ofreaders who are not statisticians, let us say clearly what we mean by referenceto these subjects.

Definition 10. Probability is the mathematical theory intended to describechance/randomness/random variation.

The theory of probability provides a language and set of concepts and resultsdirectly relevant to describing the variation and less-than-perfect predictabilityof real world measurements.

Definition 11. Statistics is the study of how best to

1. collect data,2. summarize or describe data, and3. draw conclusions or inferences based on data,

all in a framework that recognizes the reality and omnipresence of variation inphysical processes.

How sources of physical variation interact with a (statistical) data collectionplan governs how measurement error is reflected in the resulting data set (andultimately what of practical importance can be learned). On the other hand,statistical efforts are an essential part of understanding, quantifying, and im-proving the quality of measurement. Appropriate data collection and analysisprovides ways of identifying (and ultimately reducing the impact of) sources ofmeasurement error.

The subjects of probability and statistics together provide a framework forclear thinking about how sources of measurement variation and data collectionstructures combine to produce observable variation, and conversely how observ-able variation can be decomposed to quantify the importance of various sourcesof measurement error.

1.3. Some Complications

The simplest discussions of modeling, quantifying, and to the extent possiblemitigating the effects of measurement error do not take account of some veryreal and important complications that arise in many scientific and engineeringmeasurement contexts. We consider how some of these can be handled as thisexposition proceeds. For the time being, we primarily simply note their potentialpresence and the kind of problems they present.

Some measurement methods are inherently destructive and cannot be usedrepeatedly for a given measurand. This makes evaluation of the quality of ameasurement system problematic at best, and only fairly indirect methods ofassessing precision, bias, and measurement uncertainty are available.



It is common for sensors used over time to produce measurements that ex-hibit “carry-over” effects from “the last” measurement produced by the sensor.Slightly differently put, it is common for measured changes in a physical state tolag actual changes in that state. This effect is known as hysteresis. For exam-ple, temperature sensors often read “high” in dynamic conditions of decreasingtemperature and read “low” in dynamic conditions of increasing temperature,thereby exhibiting hysteresis.

In developing a measurement method, one wants to eliminate any impor-tant systematic effects on measurements of recognizable variables other thanthe measurand. Where some initial version of a measurement method does pro-duce measurements depending in a predictable way upon a variable besides themeasurand and that variable can itself be measured, the possibility exists ofcomputing and applying an appropriate correction for the effects of that vari-able. For example, performance of a temperature sensor subject to hysteresiseffects might be improved by adjusting raw read temperatures in light of tem-peratures read at immediately preceding time periods.

The simplest measurement contexts (in both conceptual and operationalterms) are those where a method has precision that does not depend uponthe measurand. “Constant variance” statistical models and methods are sim-pler and more widely studied than those that allow for non-constant variance.But where precision of measurement is measurand-dependent, it is essential torecognize that fact in modeling and statistical analysis. One of the great virtuesof the Bayes statistical paradigm that we will employ in our discussion is itsability to incorporate features such as non-constant variance into an analysis inmore or less routine fashion.

Real world measurements are always expressed only “to some number of dig-its.” This is especially obvious in a world where gauges are increasingly equippedwith digital displays. But then, if one conceives of measurands as real (infinitenumber of decimal places) numbers, this means that there is so-called quan-tization error or digitalization error to be accounted for when modelingand using measurements. When one reports a measurement as 4.1 mm what istypically meant is “between 4.05 mm and 4.15 mm.” So, strictly speaking, 4.1 isnot a real number, and the practical meaning of ordinary arithmetic operationsapplied to such values is not completely clear. Sometimes this issue can be safelyignored, but in other circumstances it cannot. Ordinary statistical methods ofthe sort presented in essentially every elementary introduction to the subjectare built on an implicit assumption that the numbers used are real numbers. Acareful handling of the quantization issue therefore requires rethinking of whatare appropriate statistical methods, and we will find the notion of interval cen-soring/rounding to be helpful in this regard. (We issue a warning at this earlyjuncture that despite the existence of a large literature on the subject, much ofit in electrical engineering venues, we consider the standard treatment of quan-tization error as independent of the measurement and uniformly distributed, tobe thoroughly unsatisfactory.)

Quantization might be viewed as one form of “coarsening” of observationsin a way that potentially causes some loss of information available from mea-



surement. (This is at least true if one adopts the conceptualization that corre-sponding to a real-valued measurand there is a conceptual perfect real-valuedmeasurement that is converted to a digital response in the process of report-ing.) There are other similar but more extreme possibilities in this directionthat can arise in particular measurement contexts. There is often the possibilityof encountering a measurand that is “off the scale” of a measuring method. Itcan be appropriate in such contexts to treat the corresponding observation as“left-censored” at the lower end of the measurement scale or “right censored”at the upper end. A potentially more problematic circumstance arises in somechemical analyses, where an analyst may record a measured concentration onlyas below some “limit of detection.”1 This phrase often has a meaning lessconcrete and more subjective than simply “off scale” and reported limits ofdetection can vary substantially over time and are often not accompanied bygood documentation of laboratory circumstances. But unless there is an under-standing of the process by which a measurement comes to be reported as belowa particular numerical value that is adequate to support the development of aprobability model for the case, little can be done in the way of formal use ofsuch observations in statistical analyses.

Related to the notion of coarsening of observations is the possibility that finalmeasurements are produced only on some ordinal scale. At an extreme, a testmethod might produce a pass/fail (0/1) measurement. (It is possible, but notnecessary, that such a result comes from a check as to whether some real-numbermeasurement is in an interval of interest.) Whether ordinal measurements arecoarsened versions of conceptual real-number measurements or are somehowarrived at without reference to any underlying interval scale, the meaning ofarithmetical operations applied to them is unclear; simple concepts such as biasas in Definition 6 are not directly relevant. The modeling of measurement errorin this context requires something other than the most elementary probabilitymodels, and realistic statistical treatments of such measurements must be madein light of these less common models.

2. Probability Modeling and Measurement

2.1. Use of Probability to Describe Empirical Variation andUncertainty

We will want to use the machinery of probability theory to describe various kindsof variation and uncertainty in both measurement and the collection of statisticaldata. Most often, we will build upon continuous univariate and joint distribu-tions. This is realistic only for real-valued measurands and measurements andreflects the extreme convenience and mathematical tractability of such models.

1This use of the terminology “limit of detection” is distinct from a second one common inanalytical chemistry contexts. If a critical limit is set so that a “blank” sample will rarely bemeasured above this limit, the phrase “lower limit of detection” is sometimes used to meanthe smallest measurand that will typically produce a measurement above the critical limit.See pages 28-34 of [19] in this regard.



We will use notation more or less standard in graduate level statistics courses,except possibly for the fact that we will typically not employ capital lettersfor random variables. The reader can determine from context whether a lowercase letter is standing for a particular random variable or for a realized/possiblevalue of that random variable.

Before going further, we need to make a philosophical digression and saythat at least two fundamentally different kinds of things might get modeledwith the same mathematical formalisms. That is, a probability density f (y) fora measurement y can be thought of as modeling observable empirical variationin measurement. This is a fairly concrete kind of modeling. A step removedfrom this might be a probability density f (x) for unobservable, but neverthelessempirical variation in a measurand x (perhaps across time or across differentitems upon which measurements are taken). Hence inference about f (x) mustbe slightly indirect and generally more difficult than inference about f (y) butis important in many engineering contexts. (We are here abusing notation in acompletely standard way and using the same name, f , for different probabilitydensity functions (pdf’s).)

A different kind of application of probability is to the description of uncer-tainty. That is, suppose φ represents some set of basic variables that enter aformula for the calculation of a measurement y and one has no direct observa-tion(s) on φ, but rather only some externally produced single value for it anduncertainty statements (of potentially rather unspecified origin) for its compo-nents. One might want to characterize what is known about φ with some (joint)probability density. In doing so, one is not really thinking of that distributionas representing potential empirical variation in φ, but rather the state of one’signorance (or knowledge) of it. With sufficient characterization, this uncertaintycan be “propagated” through the measurement calculation to yield a resultingmeasure of uncertainty for y.

While few would question the appropriateness of using probability to modeleither empirical variation or (rather loosely defined) “uncertainty” about someinexactly known quantity, there could be legitimate objection to combining thetwo kinds of meaning in a single model. In the statistical world, controversyabout simultaneously modeling empirical variation and “subjective” knowledgeabout model parameters was historically the basis of the “Bayesian-frequentistdebate.” As time has passed and statistical models have increased in complexity,this debate has faded in intensity and, at least in practical terms, has beenlargely carried by the Bayesian side (that takes as legitimate the combiningof different types of modeling in a single mathematical structure). We will seethat in some cases, “subjective” distributions employed in Bayesian analyses arechosen to be relatively “uninformative” and ultimately produce inferences littledifferent from frequentist ones.

In metrology, the question of whether Type A and Type B uncertaintiesshould be treated simultaneously (effectively using a single probability model)has been answered in the affirmative by the essentially universally adopted Guideto the Expression of Uncertainty in Measurement originally produced by the In-ternational Organization for Standardization (see [10]). As Leon Gleser has said



in [9], the recommendations made in the so-called “GUM” (guide to uncertaintyin measurement) “can be regarded as approximate solutions to certain frequen-tist and Bayesian inference problems.”

In this exposition we will from time to time and without further discussionuse single probability models, part of which describe empirical variation andpart of which describe uncertainty. We will further (as helpful) employ bothfrequentist and Bayesian statistical analyses, the latter especially because oftheir extreme flexibility and ability to routinely handle inferences for which noother methods are common in the existing statistical literature.

As a final note in this section, we point out what are the standard choicesof models for Type B uncertainty. It seems to be fairly standard in metrologythat where a univariate variable φ is announced as having value φ∗ and some“standard uncertainty” u, probability models entertained for φ are

• normal with mean φ∗ and standard deviation u, or• uniform on

(φ∗ −

√3u, φ∗ +

√3u)

(and therefore with mean φ∗ and stan-dard deviation u), or

• symmetrically triangular on(φ∗ −

√6u, φ∗ +

√6u)

(and therefore againwith mean φ∗ and standard deviation u).

2.2. High-Level Modeling of Measurements

2.2.1. A Single Measurand

Here we introduce probability notation that we will employ for describing ata high level the measurement of a single, isolated, conceptually real-numbermeasurand x to produce a conceptually real-number measurement y and mea-surement error e = y − x. In the event that the distribution of measurementerror is not dependent upon identifiable variables other than the measuranditself, it is natural to model y with some (conditional on x) probability density

f (y|x) (1)

which then leads to a (conditional) mean and variance for y

E [y|x] = µ (x) and Var [y|x] = Var [e|x] = σ2 (x) .

Ideally, a measurement method is unbiased (i.e., perfectly accurate) and

µ (x) = x, i.e., E [e|x] = 0 ,

but when that cannot be assumed, one might well call

δ (x) = µ (x)− x = E [e|x]

the measurement bias. Measurement method linearity requires that δ (x) beconstant, i.e., δ (x) ≡ δ. Where measurement precision is constant, one cansuppress the dependence of σ2 (x) upon x and write simply σ2.



Somewhat more complicated notation is appropriate when the distribution ofmeasurement error is known to depend on some identifiable and observed vectorof variables z. It is then natural to model y with some (conditional on both xand z) probability density

f (y|x, z) (2)

(we continue to abuse notation and use the same name, f , for a variety ofdifferent functions) which in this case leads to a (conditional) mean and variance

E [y|x, z] = µ (x, z) and Var [y|x, z] = Var [e|x, z] = σ2 (x, z) .

Here the measurement bias is

δ (x, z) = µ (x, z)− x = E [e|x, z]

and potentially depends upon both x and z. This latter possibility is clearlyproblematic unless one can either

1. hold z constant at some value z0 and reduce bias to (at worst) a functionof x alone, or

2. so fully model the dependence of the mean measurement on both x and zthat one can for a particular z = z∗ observed during the measurement pro-cess apply the corresponding function of x, δ (x, z∗), in the interpretationof a measurement.

Of course, precision that depends upon z is also unpleasant and would call forcareful modeling and statistical analysis, and (especially if z itself is measuredand thus not perfectly known) the kind of correction suggested in possibility 2above will in practice not be exact.

Should one make multiple measurements of a single measurand, say y1, y2, . . . ,yn, the simplest possible modeling of these is as independent random variableswith joint pdf either

f (y|x) =

n∏i=1

f (yi|x)

in the case the measurement error distribution does not depend upon identifiablevariables besides the measurand and form (1) is used, or

f (y|x, (z1, z2, . . . ,zn)) =

n∏i=1

f (yi|x, zi)

in the case of form (2) where there is dependence of the form of the measurementerror distribution upon z.

2.2.2. Measurands From a Stable Process or Fixed “Population”

In engineering contexts, it is fairly common to measure multiple items all comingfrom what one hopes is a physically stable process. In such a situation, there are



multiple measurands and one might well conceive of them as generated in aniid (independently identically distributed) fashion from some fixed distribution.For any one of the measurands, x, it is perhaps plausible to suppose that x haspdf (describing empirical variation)

f (x) (3)

and based on calculation using this distribution we will adopt the notation

Ex = µx and Varx = σ2x . (4)

In such engineering contexts, these process parameters (4) are often of as muchinterest as the individual measurands they produce.

Density (3) together with conditional density (1) produce a joint pdf for an(x, y) pair and marginal moments for the measurement

Ey = EE [y|x] = E (x+ δ (x)) = µx + Eδ (x) (5)

andVary = VarE [y|x] + EVar [y|x] = Varµ (x) + Eσ2 (x) . (6)

Now relationships (5) and (6) are, in general, not terribly inviting. Note howeverthat in the case the measurement method is linear (δ (x) ≡ δ and µ (x) =E[y|x] =x+ δ) and measurement precision is constant in x, displays (5) and (6) reduceto

Ey = µx + δ and Vary = σ2x + σ2 . (7)

Further observe that the marginal pdf for y following from densities (3) and (1)is

f (y) =

∫f (y|x) f (x) dx . (8)

The variant of this development appropriate when the distribution of mea-surement error is known to depend on some identifiable and observed vector ofvariables z has

E [y|z] = E [E [y|x, z] |z] = E [x+ δ (x, z) |z] = µx + E [δ (x, z) |z]

and

Var [y|z] = Var [E [y|x, z] |z] + E [Var [y|x, z] |z]

= Var [µ (x, z) |z] + E[σ2 (x, z) |z

]. (9)

Various simplifying assumptions about bias and precision can lead to simplerversions of these expressions. The pdf of y|z following from (3) and (2) is

f (y|z) =

∫f (y|x, z) f (x) dx . (10)

Where single measurements are made on iid measurands x1, x2, . . . , xn andthe form of measurement error distribution does not depend on any identifiable



additional variables beyond the measurand, a joint (unconditional) density forthe corresponding measurements y1, y2, . . . , yn is

f (y) =

n∏i=1

f (yi) ,

for f (y) of the form (8). A corresponding joint density based on the form (10)for cases where the distribution of measurement error is known to depend onidentifiable and observed vectors of variables z1, z2, . . . ,zn is

f (y| (z1, z2, . . . ,zn)) =

n∏i=1

f (yi|zi) .

2.2.3. Multiple Measurement “Methods”

The notation z introduced in Section 2.2.1 can be used to describe several dif-ferent kinds of “non-measurand effects” on measurement error distributions. Insome contexts, z might stand for some properties of a measured item that donot directly affect the measurand associated with it, but do affect measurement.In others, z might stand for ambient or environmental conditions present whena measurement is made that are not related to the measurand. In other con-texts, z might stand for some details of measurement protocol or equipmentor personnel that again do not directly affect what is being measured, but doimpact the measurement error distribution. In this last context, different valuesof z might be thought of as effectively producing different measurement devicesor “methods,” and in fact assessing the importance of z to measurement (andmitigating any large potentially worrisome effects so as to make measurementmore “consistent”) is an important activity.

One instance of this last possibility is that where z is a qualitative variable,taking one of, say, J different values j = 1, 2, . . . , J in a measurement study.In an engineering production context, each possible value of z might identifya different human operator that will use a particular gauge and protocol tomeasure some geometric characteristic of a metal part. In this context, changesin δ (x, z) as z changes is often called reproducibility variation. Where eachδ (x, j) is assumed to be constant in x (each operator using the gauge producesa linear measurement “system”) with say δ (x, j) ≡ δj , it is reasonably commonto model the δj as random and iid with variance σ2

δ , a so-called reproducibilityvariance. This parameter becomes something that one wants to learn aboutfrom data collection and analysis. Where it is discovered to be relatively large,remedial action usually focuses on training operators to take measurements “thesame way” through improved protocols, fixturing devices, etc.

In a similar way, it is common for measurement organizations to run so-calledround-robin studies involving a number of different laboratories, all measuringwhat is intended to be a standard specimen (with a common measurand) in orderto evaluate lab-to-lab variability in measurement. In the event that one assumes



that all of labs j = 1, 2, . . . , J measure in such a way that δ (x, j) ≡ δj , it isvariability among, or differences in, these lab biases that is of first interest in around-robin study.

2.2.4. Quantization/Digitalization/Rounding and Other Interval Censoring

For simplicity and without real loss of generality, suppose now that while mea-surement y is conceptually real-valued, it is only observed to the nearest integer.(If a measurement is observed to the kth decimal, then multiplication by 10k

produces an integer-valued observed measurement with units 10−k times theoriginal units.) In fact, let byc stand for the integer-rounded version of y. Thevariable byc is discrete and for integer i the model (1) for y implies that formeasurand x

P [byc = i |x] =

∫ i+.5

i−.5f (y|x) dy (11)

and the model (2) implies that for measurand x and some identifiable andobserved vector of variables z affecting the distribution of measurement error

P [byc = i |x, z] =

∫ i+.5

i−.5f (y|x, z) dy . (12)

(Limits of integration could be changed to i and i + 1 in situations where therounding is “down” rather than to the nearest integer.)

Now, the bias and precision properties of the discrete variable byc underthe distribution specified by either (11) or (12) are not the same as those ofthe conceptually continuous variable y specified by (1) or (2). For example, ingeneral for (11) and (1)

E [byc |x] 6= µ (x) and Var [byc |x] 6= σ2 (x) .

Sometimes the differences between the means and between the variances of thecontinuous and digital versions of y are small, but not always. The safest route toa rational analysis of quantized data is through the use of distributions specifiedby (11) or (12).

We remark (again) that a quite popular but technically incorrect (and poten-tially quite misleading) method of recognizing the difference between y and bycis through what electrical engineers call “the quantization noise model.” Thistreats the “quantization error”

q = byc − y (13)

as a random variable uniformly distributed on (−.5, .5) and independent of y.This is simply an unrealistic description of q, which is a deterministic function ofy (hardly independent of y!!) and rarely uniformly distributed. (Some implica-tions of these issues are discussed in elementary terms in [18]. A more extensivetreatment of the matter can be found in [3].)



It is worth noting that the notation (13) provides a conceptually appealingbreakdown of a digital measurement as

byc = x+ e+ q .

With this notation, a digital measurement is thus the measurand plus a mea-surement error plus a digitalization error.

Expressions (11) and (12) have their natural generalizations to other formsof interval censoring/coarsening of a conceptually real-valued measurement y.If for a set of intervals {Ii} (finite or infinite in number and/or extent), when yfalls into Ii one does not learn the value of y, but only that y ∈ Ii, appropriateprobability modeling of what is observed replaces conditional probability den-sity f (y|x) on R with conditional density f (y|x) on R\ ∪ {Ii} and the set ofconditional probabilities

P [y ∈ Ii |x] =

∫Ii

f (y|x) dy or P [y ∈ Ii |x, z] =

∫Ii

f (y|x, z) dy .

For example, if a measurement device can read out only measurements betweena and b, values below a might be called “left-censored” and those to the right of bmight be called “right censored.” Sensible probability modeling of measurementwill be through a probability density, its values on (a, b), and its integrals from−∞ to a and from b to ∞.

2.3. Low-Level Modeling of “Computed” Measurements

Consider now how probability can be used in the more detailed/low-level mod-eling of measurements derived as functions of several more basic quantities. Infact suppose that a univariate measurement, y, is to be derived through the useof a measurement equation

y = m (θ,φ) , (14)

where m is a known function and empirical/statistical/measured values of the

entries of θ = (θ1, θ2, . . . , θK) (say, θ =(θ1, θ2, . . . , θK

)) are combined with

externally-provided values of the entries of φ = (φ1, φ2, . . . , φL) to produce

y = m(θ,φ

). (15)

For illustration, consider determining the cross-sectional area A, of a lengthl, of copper wire by measuring its electrical resistance, R, at 20 ◦C. Physicaltheory says that the resistivity ρ of a material specimen with constant cross-section at a fixed temperature is

ρ = RA

l



so that

A =ρl

R. (16)

Using the value of ρ for copper available from a handbook and measuring valuesl and R, it’s clear that one can obtain a measured value of A through the useof the measurement equation (16). In this context, θ = (l, R) and φ = ρ.

Then, we further suppose that Type B uncertainties are provided for theelements of φ, say via the elements of u = (u1, u2, . . . , uL). Finally, suppose that

Type A standard errors are available for the elements of θ, say via the elementsof s = (s1, s2, . . . , sK). The GUM prescribes that a “standard uncertainty”associated with y in display (15) be computed as

u =

√√√√ K∑i=1

(∂y

∂θi

∣∣∣∣(θ,φ)

)2

s2i +

L∑i=1

(∂y

∂φi

∣∣∣∣(θ,φ)

)2

u2i . (17)

Formula (17) has the form of a “1st order delta-method approximation” tothe standard deviation of a random quantity m (θ∗,φ∗) defined in terms ofthe function m and independent random vectors θ∗ and φ∗ whose mean vec-tors are respectively θ and φ and whose covariance matrices are respectivelydiag

(s21, s

22, . . . , s

2K

)and diag

(u21, u

22, . . . , u

2L

). Note that the standard uncer-

tainty (17) (involving as it does the Type B ui’s) is a “Type B” quantity.We have already said that the GUM has answered in the affirmative the

implicit philosophical question of whether Type A and Type B uncertainties canlegitimately be combined into a single quantitative assessment of uncertainty ina computed measurement, and (17) is one manifestation of this fact. It remainsto produce a coherent statistical rationale for something similar to (15) and (17).The Bayesian statistical paradigm provides one such rationale. In this regardwe offer the following.

Let w stand for data collected in the process of producing the measurementy, and f (w|θ) specify a probability model for w that depends upon θ as a(vector) parameter. Suppose that one then provides a “prior” (joint) probabilitydistribution for θ and φ that is meant to summarize one’s state of knowledge orignorance about these vectors before collecting the data. We will assume thatthis distribution is specified by some joint “density”

g (θ,φ) .

(This could be a pdf, a probability mass function (pmf), or some hybrid, specify-ing a partially continuous and partially discrete joint distribution.) The product

f (w|θ) g (θ,φ) (18)

treated as a function of θ and φ (for the observed data w plugged in) is thenproportional to a posterior (conditional on the data) joint density for θ and φ.This specifies what is (at least from a Bayesian perspective) a legitimate proba-bility distribution for θ and φ. This posterior distribution in turn immediately



leads (at least in theory) through equation (14) to a posterior probability distri-bution for m (θ,φ). Then, under suitable circumstances2, forms (15) and (17)are potential approximations to the posterior mean and standard deviation ofm (θ,φ). Of course, a far more direct route of analysis is to simply take theposterior mean (or median) as the measurement and the posterior standarddeviation as the standard uncertainty.

As this exposition progresses we will provide some details as to how a fullBayesian analysis can be carried out for this kind of circumstance using theWinBUGS software to do Markov chain Monte Carlo simulation from the posteriordistribution. Our strong preference for producing Type B uncertainties in thiskind of situation is to use the full Bayesian methodology and posterior standarddeviations in preference to the more ad hoc quantity (17).

3. Simple Statistical Inference and Measurement (Type AUncertainty Only)

While quite sophisticated statistical methods have their place in some mea-surement applications, many important issues in statistical inference and mea-surement can be illustrated using very simple methods. So rather than begindiscussion with complicated statistical analyses, we begin by considering howbasic statistical inference interacts with basic measurement. Some of the dis-cussion of Sections 3.1, 3.2, 3.4, and 3.5 below is a more general and technicalversion of material that appeared in [21].

3.1. Frequentist and Bayesian Inference for a Single Mean

Probably the most basic method of statistical inference is the (frequentist) tconfidence interval for a mean, that computed from observations w1, w2, . . . , wnwith sample mean w and sample standard deviation sw has endpoints

w ± t sw√n

(19)

(where t is a small upper percentage point of the tn−1 distribution). These limits(19) are intended to bracket the mean of the data-generating mechanism thatproduces the wi. The probability model assumption supporting formula (19) isthat the wi are iid N

(µw, σ

2w

), and it is the parameter µw that is under discus-

sion. Careful thinking about measurement and probability modeling reveals anumber of possible real-world meanings for µw, depending upon the nature ofthe data collection plan and what can be assumed about the measuring method.Among the possibilities are at least

2For example, if m is “not too non-linear,” the ui are “not too big,” the prior is one ofindependence of θ and φ, the prior for φ makes its components independent with meansthe externally prescribed values and variances u2

i , the prior for θ is “flat,” and estimatedcorrelations between estimates of the elements of θ based on the likelihood f (w|θ) are small,then (15) and (17) will typically be decent approximations for respectively the posterior meanand standard deviation of m (θ,φ).



1. a measurand plus measurement bias, if the wi are measurements yi maderepeatedly under fixed conditions for a single measurand,

2. an average measurand plus measurement bias, if the wi are measurementsyi for n different measurands that are themselves drawn from a stableprocess, under the assumption of measurement method linearity (constantbias) and constant precision,

3. a measurand plus average bias, if the wi are measurements yi made fora single measurand, but with randomly varying and uncorrected-for vec-tors of variables zi that potentially affect bias, under the assumption ofconstant measurement precision, and

4. a difference in measurement method biases (for two linear methods withpotentially different but constant associated measurement precisions) ifthe wi are differences di = y1i − y2i between measurements made usingmethods 1 and 2 for n possibly different measurands.

Let us elaborate somewhat on contexts 1–4 based on the basics of Section 2.2,beginning with the simplest situation of context 1.

Where repeat measurements y1, y2, . . . , yn for measurand x are made by thesame method under fixed physical conditions, the development of Section 2.2.1is relevant. An iid model with marginal mean x + δ (x) (or x + δ (x, z0) forfixed z0) and marginal variance σ2 (x) (or σ2 (x, z0) for fixed z0) is plausible,and upon either making a normal distribution assumption or appealing to thewidely known robustness properties of the t interval, limits (19) applied to then measurements yi serve to estimate x+ δ (x) (or x+ δ (x, z0)).

Note that this first application of limits (19) provides a simple method ofcalibration. If measurand x is “known” because it corresponds to some sort ofstandard specimen, limits

y ± t sy√n

for x+ δ (x) correspond immediately to limits

(y − x)± t sy√n

for the bias δ (x).Consider then context 2. Where single measurements y1, y2, . . . , yn for mea-

surands x1, x2, . . . , xn (modeled as generated in an iid fashion from a distribu-tion with Ex = µx and Varx = σ2

x) are made by a linear device with constantprecision, the development of Section 2.2.2 is relevant. If measurement errorsare modeled as iid with mean δ (the fixed bias) and variance σ2, display (7)gives marginal mean and variance for the (iid) measurements, Ey = µx + δ andVary = σ2

x + σ2. Then again appealing to either normality or robustness, limits(19) applied to the n measurements yi serve to estimate µx + δ.

Next, consider context 3. As motivation for this case, one might take a situ-ation where n operators each provide a single measurement of some geometricfeature of a fixed metal part using a single gauge. It is probably too much toassume that these operators all use the gauge exactly the same way, and so



an operator-dependent bias, say δ (x, zi), might be associated with operatori. Denote this bias as δi. Arguing as in the development of Section 2.2.3, onemight suppose that measurement precision is constant and the δi are iid withEδ = µδ and Varδ = σ2

δ and independent of measurement errors. That in turngives Ey = x + µδ and Vary = σ2

δ + σ2, and limits (19) applied to n iidmeasurements yi serve to estimate x+ µδ.

Finally consider context 4. If a single measurand produces y1 using Method1 and y2 using Method 2, under an independence assumption for the pair (andproviding the error distributions for both methods are unaffected by identifiablevariables besides the measurand), the modeling of Section 2.2.1 prescribes thatthe difference be treated as having E(y1 − y2) = (x+ δ1 (x)) − (x+ δ2 (x)) =δ1 (x) − δ2 (x) and Var(y1 − y2) = σ2

1 (x) + σ22 (x). Then, if the two methods

are linear with constant precisions, it is plausible to model the differences di =y1i − y2i as iid with mean δ1 − δ2 and variance σ2

1 + σ22 , and limits (19) applied

to the n differences serve to estimate δ1 − δ2 and thereby provide a comparisonof the biases of the two measurement methods.

It is an important (but largely ignored) point that if one applies limits (19)to iid digital/quantized measurements by1c , by2c , . . . , bync, one gets inferencesfor the mean of the discrete distribution of byc, and NOT for the mean ofthe conceptually continuous distribution of y. As we remarked in Section 2.2.4,these means may or may not be approximately the same. In Section 3.3 we willdiscuss methods of inference different from limits (19) that can always be usedto employ quantized measurements in inference for the mean of the continuousdistribution.

The basic case of inference for a normal mean represented by the limits (19)offers a simple context in which to make a first reasonably concrete illustrationof Bayesian inference. That is, an alternative to the frequentist analysis leadingto the t confidence limits (19) is this. Suppose that w1, w2, . . . , wn are modeledas iid N

(µw, σ

2w

)and let

f(w|µw, σ2

w

)be the corresponding joint pdf for w. If one subsequently adopts an “improperprior” (improper because it specifies a “distribution” with infinite mass, not aprobability distribution) for

(µw, σ

2w

)that is a product of a “Uniform(−∞,∞)”

distribution for µw and a Uniform(−∞,∞) distribution for ln (σw), the cor-responding “posterior distribution” (distribution conditional on data w) forµw produces probability intervals equivalent to the t intervals. That is, settingγw = ln (σw) one might define a “distribution” for (µw, γw) using a density onR2

g (µw, γw) ≡ 1

and a “joint density” for w and (µw, γw)

f (w|µw, exp (2γw)) · g (µw, γw) = f (w|µw, exp (2γw)) . (20)

Provided n ≥ 2, when treated as a function of (µw, γw) for observed valuesof wi plugged in, display (20) specifies a legitimate (conditional) probability



distribution for (µw, γw). (This is despite the fact that it specifies infinite totalmass for the joint distribution of w and (µw, γw).) The corresponding marginalposterior distribution for µw is that of w+Tsw/

√n for T ∼ tn−1, which leads to

posterior probability statements for µw operationally equivalent to frequentistconfidence statements.

We will, in this exposition, use the WinBUGS software as a vehicle for enablingBayesian computation and where appropriate provide some WinBUGS code. Hereis some code for implementing the Bayesian analysis just outlined, demonstratedfor an n = 5 case.

WinBUGS Code Set 1

#here is the model statement

model {muw~dflat()

logsigmaw~dflat()

sigmaw<-exp(logsigmaw)tauw<-exp(-2*logsigmaw)for (i in 1:N) {W[i]~dnorm(muw,tauw)

}#WinBUGS parameterizes normal distributions with the

#second parameter inverse variances, not variances

}#here are some hypothetical data

list(N=5,W=c(4,3,3,2,3))

#here is a possible initialization

list(muw=7,logsigmaw=2)

Notice that in the code above, w1 = 4, w2 = 3, w3 = 3, w4 = 2, w5 = 3are implicitly treated as real numbers. (See Section 3.3 for an analysis thattreats them as digital.) Computing with these values, one obtains w = 3.0 andsw =

√1/2. So using t4 distribution percentage points, 95% confidence limits

(19) for µw are

3.0± 2.7760.7071√

5,

i.e.,2.12 and 3.88 .

These are also 95% posterior probability limits for µw based on structure (20).In this simple Bayesian analysis, the nature of the posterior distribution can beobtained either exactly from the examination of the form (20), or in approxi-mate terms through the use of WinBUGS simulations. As we progress to more



complicated models and analyses, we will typically need to rely solely on thesimulations.

3.2. Frequentist and Bayesian Inference for a Single StandardDeviation

Analogous to the t limits (19) are χ2 confidence limits for a single standarddeviation that computed from observations w1, w2, . . . , wn with sample standarddeviation sw are

sw

√n− 1

χ2upper

and/or sw

√n− 1

χ2lower

(21)

(for χ2upper and χ2

lower respectively small upper and lower percentage points ofthe χ2

n−1 distribution). These limits (21) are intended to bracket the standarddeviation of the data-generating mechanism that produces the wi and are basedon the model assumption that the wi are iid N

(µw, σ

2w

). Of course, it is the

parameter σw that is thus under consideration and there are a number of possiblereal-world meanings for σw, depending upon the nature of the data collectionplan and what can be assumed about the measuring method. Among these areat least

1. a measurement method standard deviation if the wi are measurements yimade repeatedly under fixed conditions for a single measurand,

2. a combination of measurand and measurement method standard devia-tions if the wi are measurements yi for n different measurands that arethemselves independently drawn from a stable process, under the assump-tion of measurement method linearity and constant precision, and

3. a combination of measurement method and bias standard deviations if thewi are measurements yi made for a single measurand, but with randomlyvarying and uncorrected-for vectors of variables zi that potentially affectbias, under the assumption of constant measurement precision.

Let us elaborate somewhat on contexts 1-3.First, as for context 1 in Section 3.1, where repeat measurements y1, y2, . . . , yn

for measurand x are made by the same method under fixed physical conditions,the development of Section 2.2.1 is relevant. An iid model with fixed marginalvariance σ2 (x) (or σ2 (x, z0) for fixed z0) is plausible. Upon making a normaldistribution assumption, the limits (21) then serve to estimate σ (x) (or σ (x, z0)for fixed z0) quantifying measurement precision.

For context 2 (as for the corresponding context in Section 3.1), where singlemeasurements y1, y2, . . . , yn for measurands x1, x2, . . . , xn (modeled as gener-ated in an iid fashion from a distribution with Varx = σ2

x) are made by alinear device with constant precision, the development of Section 2.2.2 is rel-evant. If measurement errors are modeled as iid with mean δ (the fixed bias)and variance σ2, display (7) gives marginal variance for the (iid) measurements,Vary = σ2

x + σ2. Then under a normal distribution assumption, limits (21)applied to the n measurements yi serve to estimate

√σ2x + σ2.



Notice that in the event that one or the other of σx and σ can be takenas “known,” confidence limits for

√σ2x + σ2 can be algebraically manipulated

to produce confidence limits for the other standard deviation. If, for example,one treats measurement precision (i.e., σ2) as known and constant, limits (21)computed from single measurements y1, y2, . . . , yn for iid measurands correspondto limits√

max

(0, s2

(n− 1

χ2upper

)− σ2

)and/or

√max

(0, s2

(n− 1

χ2lower

)− σ2

)(22)

for σx quantifying the variability of the measurands. In an engineering produc-tion context, these are limits for the “process standard deviation” uninflated bymeasurement noise.

Finally consider context 3. As for the corresponding context in Section 3.1,take as motivation a situation where n operators each provide a single measure-ment of some geometric feature of a fixed metal part using a single gauge andan operator-dependent bias, say δ (x, zi), is associated with operator i. Abbre-viating this bias to δi and again arguing as in the development of Section 2.2.3,one might suppose that measurement precision is constant and the δi are iidwith Varδ = σ2

δ . That (as before) gives Vary = σ2δ + σ2, and limits (21) applied

to n iid measurements yi serve to estimate√σ2δ + σ2.

In industrial instances of this context (of multiple operators making singlemeasurements on a fixed item) the variances σ2 and σ2

δ are often called re-spectively repeatability and reproducibility variances. The first is typicallythought of as intrinsic to the gauge or measurement method and the second aschargeable to consistent (and undesirable and potentially reducible) differencesin how operators use the method or gauge. As suggested regarding context 2,where one or the other of σ and σδ can be treated as known from previousexperience, algebraic manipulation of confidence limits for

√σ2δ + σ2 lead (in a

way parallel to display (22)) to limits for the other standard deviation. We willlater (in Section 5.6) consider the design and analysis of studies intended to atonce provide inferences for both σ and σδ.

A second motivation for context 3 is one where there may be only a singleoperator, but there are n calibration periods involved, period i having its ownperiod-dependent bias, δi. Calibration is never perfect, and one might again sup-pose that measurement precision is constant and the δi are iid with Varδ = σ2

δ .In this scenario (common, for example, in metrology for nuclear safeguards) σ2

and σ2δ are again called respectively repeatability and reproducibility variances,

but the meaning of the second of these is not variation of operator biases, butrather variation of calibration biases.

As a final topic in this section, consider Bayesian analyses that can produceinferences for a single standard deviation parallel to those represented by dis-play (21). The Bayesian discussion of the previous section and in particularthe implications of the modeling represented by (20) were focused on poste-rior probability statements for µw. As it turns out, this same modeling andcomputation produces a simple marginal posterior distribution for σw. That is,



following from joint “density” (20) is a conditional distribution for σ2w which

is that of (n− 1) s2w/X for X a χ2n−1 random variable. That implies that the

limits (21) are both frequentist confidence limits and also Bayesian posteriorprobability limits for σw. That is, for the improper independent uniform priorson µw and γw = ln (σw), Bayesian inference for σw is operationally equivalentto “ordinary” frequentist inference. Further, WinBUGS Code Set 1 in Section3.1 serves to generate not only simulated values of µw, but also simulated valuesof σw from the joint posterior of these variables. (One needs only Code Set 1to do Bayesian inference in this problem for any function of the pair (µw, σw),including the individual variables.)

3.3. Frequentist and Bayesian Inference from a Single DigitalSample

The previous two sections have considered inference associated with a singlesample of measurements, assuming that these are conceptually real numbers.We now consider the related problem of inference assuming that measurementshave been rounded/quantized/digitized. We will use the modeling ideas of Sec-tion 2.2.4 based on a normal model for underlying conceptually real-valued mea-surements y. That is, we suppose that y1, y2, . . . , yn are modeled as iid N

(µ, σ2

)(where depending upon the data collection plan and properties of the measure-ment method, the model parameters µ and σ can have any of the interpretationsconsidered in the previous two Sections, 3.1 and 3.2). Available for analysis arethen integer-valued versions of the yi, byic. For f

(·|µ, σ2

)the univariate normal

pdf, the likelihood function (of the two parameters) is

L(µ, σ2

)=

n∏i=1

∫ byic+.5byic−.5

f(t|µ, σ2

)dt (23)

and the corresponding log-likelihood function for µ and γ = ln (σ) is

l (µ, γ) = lnL (µ, exp (2γ)) (24)

and is usually taken as the basis of frequentist inference.One version of inference for the parameters based on the function (24) is

roughly this. Provided that the range of the byic is at least 2, the function (24) isconcave down, a maximizing pair (µ, γ) can serve as a joint maximum likelihoodestimate of the vector, and the diagonal elements of the negative inverse Hessianfor this function (evaluated at the maximizer) can serve as estimated variances

for estimators µ and γ, say σ2µ and σ2

γ . These in turn lead to approximateconfidence limits for µ and γ of the form

µ± zσµ and γ ± zσγ

(for z a small upper percentage point of the N(0, 1) distribution) and the secondof these provides limits for σ2 after multiplication by 2 and exponentiation.



This kind of analysis can be implemented easily in some standard statisticalsoftware suites that do “reliability” or “life data analysis” as follows. If y isnormal, then exp (y) is “lognormal.” The lognormal distribution is commonlyused in life data analysis and life data are often recognized to be “intervalcensored” and are properly analyzed as such. Upon entering the intervals

(exp (byic − .5) , exp (byic+ .5))

as observations and specifying a lognormal model in some life data analysisroutines, essentially the above analysis will be done to produce inferences for µand σ2. For example, the user-friendly JMP statistical software (JMP, Version 9,SAS Institute, Cary N.C., 1989-2011) will do this analysis.

A different frequentist approach to inference for µ and σ2 in this model wastaken in [11] and [12]. Rather than appeal to large-sample theory for maximumlikelihood estimation as above, limits for the parameters based on profile log-likelihood functions were developed and extensive simulations were used to verifythat their intervals have actual coverage properties at least as good as nominaleven in small samples. The intervals for µ from [11] are of the form{

µ| supγl (µ, γ) > l (µ, γ)− c

}, (25)

where if

c =n

2ln

(1 +

t2

n− 1

)for t the upper α point of the tn−1 distribution, the interval (25) has corre-sponding actual confidence level at least (1− 2α) × 100%. The correspondingintervals for σ are harder to describe in precise terms, but of the same spirit,and the reader is referred to [12] for details.

A Bayesian approach to inference for(µ, σ2

)in this digital data context

is to replace f (data|µ, exp (2γ)) in (20) with L (µ, exp (2γ)) (for L defined in(23)) and (with improper uniform priors on µ and γ) use L (µ, exp (2γ)) tospecify a joint posterior distribution for the mean and log standard deviation.No calculations with this can be done in closed form, but WinBUGS providesfor very convenient simulation from this posterior. Appropriate example code ispresented next.

WinBUGS Code Set 2

model {mu~dflat()

logsigma~dflat()

sigma<-exp(logsigma)tau<-exp(-2*logsigma)for (i in 1:N) {



L[i]<-R[i]-.5U[i]<-R[i]+.5}for (i in 1:N) {Y[i]~dnorm(mu,tau) I(L[i],U[i])

}}#here are the hypothetical data again

list(N=5,R=c(4,3,3,2,3))


list(mu=7,logsigma=2)

It is worth running both WinBUGS Code Set 1 (in Section 3.1) and WinBUGS

Code Set 2 and comparing the approximate posteriors they produce. Theposteriors for the mean are not very different. But there is a noticeable differencein the posteriors for the standard deviations. In a manner consistent with well-established statistical folklore, the posterior for Code Set 2 suggests a smallerstandard deviation than does that for Code Set 1. It is widely recognizedthat ignoring the effects of quantization/digitalization typically inflates one’sperception of the standard deviation of an underlying conceptually continuousdistribution. This is clearly an important issue in modern digital measurement.

3.4. Frequentist and Bayesian “Two-Sample” Inference for aDifference in Means

An important elementary statistical method for making comparisons is the fre-quentist t confidence interval for a difference in means. The Satterthwaite ver-sion of this interval, computed from w11, w12, . . . , w1n1 with sample mean w1

and sample standard deviation s1 and w21, w22, . . . , w2n2 with sample mean w2

and sample standard deviation s2, has endpoints

w1 − w2 ± t

√s21n1

+s22n2

(26)

(for t an upper percentage point from the t distribution with the data-dependent“Satterthwaite approximate degrees of freedom” or, conservatively, degrees offreedom min (n1, n2)−1). These limits (26) are intended to bracket the differencein the means of the two data-generating mechanisms that produce respectivelythe w1i and the w2i. The probability model assumptions that support formula(26) are that all of the w’s are independent, the w1i iid N

(µ1, σ

21

)and the w2i

iid N(µ2, σ

22

), and it is the difference µ1 − µ2 that is being estimated.

Depending upon the data collection plan employed and what may be assumedabout the measurement method(s), there are a number of possible practicalmeanings for the difference µ1 − µ2. Among them are



1. a difference in two biases, if the w1i and w2i are measurements y1i and y2iof a single measurand, made repeatedly under fixed conditions using twodifferent methods,

2. a difference in two biases, if the w1i and w2i are single measurementsy1i and y2i for n1 + n2 measurands drawn from a stable process, madeusing two different methods under fixed conditions and the assumptionthat both methods are linear,

3. a difference in two measurands, if the w1i and w2i are repeat measure-ments y1i and y2i made on two measurands using a single linear method,and

4. a difference in two mean measurands, if the w1i and w2i are single mea-surements y1i and y2i made on n1 and n2 measurands from two stableprocesses made using a single linear method.

Let us elaborate on contexts 1-4 based on the basics of Section 2.2, beginningwith context 1.

Where repeat measurements y11, y12, . . . , y1n1for measurand x are made by

Method 1 under fixed conditions and repeat measurements y21, y22, . . . , y2n2of

this same measurand are made under these same fixed conditions by Method2, the development of Section 2.2.1 may be employed twice. An iid model withmarginal mean x+ δ1 (x) (or x+ δ1 (x, z0) for fixed z0) and marginal varianceσ21 (x) (or σ2

1 (x, z0) for this z0) independent of an iid model with marginalmean x + δ2 (x) (or x + δ2 (x, z0) for this z0) and marginal variance σ2

2 (x)(or σ2

2 (x, z0) for this z0) is plausible for the two samples of measurements.Normal assumptions or robustness properties of the method (26) then implythat the Satterthwaite t interval limits applied to the n1 + n2 measurementsserve to estimate δ1 (x)−δ2 (x) (or δ1 (x, z0)−δ2 (x, z0)). Notice that under theassumption that both devices are linear, these limits provide some informationregarding how much higher or lower the first method reads than the second forany x.

In context 2, the development of Section 2.2.2 is potentially relevant to themodeling of y11, y12, . . . , y1n1 and y21, y22, . . . , y2n2 that are single measurementson n1 + n2 measurands drawn from a stable process, made using two differentmethods under fixed conditions. An iid model with marginal mean µx+ Eδ1 (x)(or µx+ Eδ1 (x, z0) for fixed z0) and marginal variance σ2

1 ≡Eσ21 (x) indepen-

dent of an iid model with marginal mean µx+ Eδ2 (x) (or µx+ Eδ2 (x, z0) forfixed z0) and marginal variance σ2

2 ≡Eσ22 (x) might be used if δ1 (x) and δ2 (x)

(or δ1 (x, z0) and δ2 (x, z0)) are both constant in x, i.e., both methods are linear.Then normal assumptions or robustness considerations imply that method (26)can be used to estimate δ1 − δ2 (where δj ≡ δj (x) or δj (x, z0)).

In context 3, where y11, y12, . . . , y1n1and y21, y22, . . . , y2n2

are repeat mea-surements made on two measurands using a single method under fixed condi-tions, the development of Section 2.2.1 may again be employed. An iid modelwith marginal mean x1 + δ (x1) (or x1 + δ (x1, z0) for fixed z0) and marginalvariance σ2 (x1) (or σ2 (x1, z0) for this z0) independent of an iid model withmarginal mean x2 + δ (x2) (or x2 + δ (x2, z0) for this z0) and marginal variance



σ2 (x2) (or σ2 (x2, z0) for this z0) is plausible for the two samples of measure-ments. Normal assumptions or robustness properties of the method (26) then im-ply that the Satterthwaite t interval limits applied to the n1 +n2 measurementsserve to estimate x1−x2+(δ (x1)− δ (x2)) (or x1−x2+(δ (x1, z0)− δ (x2, z0))).Then, if the device is linear, one has a way to estimate x1 − x2.

Finally, in context 4, where y11, y12, . . . , y1n1 and y21, y22, . . . , y2n2 are singlemeasurements made on n1 and n2 measurands from two stable processes madeusing a single linear device, the development of Section 2.2.2 may again beemployed. An iid model with marginal mean µ1x+ δ and marginal variance σ2

1 ≡Eσ2

1 (x) independent of an iid model with marginal mean µ2x + δ and marginalvariance σ2

2 ≡Eσ22 (x) might be used. Then normal assumptions or robustness

considerations imply that method (26) can be used to estimate µ1x − µ2x.An alternative to the analysis leading to Satterthwaite confidence limits (26)

is this. To the frequentist model assumptions supporting (26), one adds priorassumptions that take all of µ1, µ2, ln (σ1), and ln (σ2) to be Uniform(−∞,∞)and independent. Provided both n1 and n2 are both at least 2, the formal “pos-terior distribution” of the four model parameters is proper (is a real probabilitydistribution) and posterior intervals for µ1 − µ2 can be obtained by simulatingfrom the posterior. Here is some code for implementing the Bayesian analysisillustrated for an example calculation with n1 = 5 and n2 = 4.

WinBUGS Code Set 3


model {muw1~dflat()

logsigmaw1~dflat()

sigmaw1<-exp(logsigmaw1)tauw1<-exp(-2*logsigmaw1)for (i in 1:N1) { W1[i]~dnorm(muw1,tauw1)

}muw2~dflat()

logsigmaw2~dflat()

sigmaw2<-exp(logsigmaw2)tauw2<-exp(-2*logsigmaw2)for (j in 1:N2) { W2[j]~dnorm(muw2,tauw2)

}mudiff<-muw1-muw2sigmaratio<-sigmaw1/sigmaw2}

#here are some hypothetical data

list(N1=5,W1=c(4,3,3,2,3),N2=4,W2=c(7,8,4,5))


list(muw1=6,logsigmaw1=2,muw2=8,logsigmaw2=3)



WinBUGS Code Set 3 is an obvious modification of WinBUGS Code Set1 (of Section 3.1). It is worth noting that a similar modification of WinBUGS

Code Set 2 (of Section 3.3) provides an operationally straightforward inferencemethod for µ1 − µ2 in the case of two digital samples. We note as well, that inanticipation of the next section, WinBUGS Code Set 3 contains a specificationof the ratio σ1/σ2 that enables Bayesian two-sample inference for a ratio ofnormal standard deviations.

3.5. Frequentist and Bayesian “Two-Sample” Inference for a Ratioof Standard Deviations

Another important elementary statistical method for making comparisons is thefrequentist F confidence interval for the ratio of standard deviations for two nor-mal distributions. This interval, computed from w11, w12, . . . , w1n1with samplestandard deviation s1 and w21, w22, . . . , w2n2 with sample standard deviations2, has endpoints

s1

s2√Fn1−1,n2−1,upper

and/ors1

s2√Fn1−1,n2−1,lower

(27)

(for Fn1−1,n2−1,upper and Fn1−1,n2−1,lower, respectively, small upper and lowerpercentage points of the Fn1−1,n2−1 distribution). These limits (27) are intendedto bracket the ratio of standard deviations for two (normal) data-generatingmechanisms that produce respectively the w1i and the w2i. The probabilitymodel assumptions that support formula (27) are that all of the w’s are inde-pendent, the w1i iid N

(µ1, σ

21

)and the w2i iid N

(µ2, σ

22

), and it is the ratio

σ1/σ2 that is being estimated.Depending upon the data collection plan employed and what may be assumed

about the measurement method(s), there are a number of possible practicalmeanings for the ratio σ1/σ2. Among them are

1. the ratio of two device standard deviations, if the w1i and w2i are mea-surements y1i and y2i of a single measurand per device made repeatedlyunder fixed conditions,

2. the ratio of two combinations of device and (the same) measurand stan-dard deviations, if the w1i and w2i are single measurements y1i and y2i forn1 +n2 measurands drawn from a stable process, made using two differentmethods under fixed conditions and the assumption that both methodsare linear and have constant precision, and

3. the ratio of two combinations of (the same) device and measurand stan-dard deviations, if the w1i and w2i are single measurements y1i and y2imade on n1 and n2 measurands from two stable processes made using asingle linear method with constant precision.

Let us elaborate slightly on contexts 1-3, beginning with context 1.Where repeat measurements y11, y12, . . . , y1n1

for measurand x1 are made byMethod 1 under fixed conditions and repeat measurements y21, y22, . . . , y2n2

of



a (possibly different) fixed measurand x2 are made under these same fixed con-ditions by Method 2, the development of Section 2.2.1 may be employed. An iidnormal model with marginal variance σ2

1 (x1) (or σ21 (x1, z0) for this z0) inde-

pendent of an iid normal model with marginal variance σ22 (x2) (or σ2

2 (x2, z0)for this z0) is potentially relevant for the two samples of measurements. Then,if measurement precision for each device is constant in x or the two measurandsare the same, limits (27) serve to produce a comparison of device precisions.

In context 2, the development in Section 2.2.2 and in particular displays (6)and (9) shows that under normal distribution and independence assumptions,device linearity and constant precision imply that limits (27) serve to producea comparison of

√σ2x + σ2

1 and√σ2x + σ2

2 , which obviously provides only anindirect comparison of measurement precisions σ1 and σ2.

Finally, in context 3, the development in Section 2.2.2 and in particular dis-plays (6) and (9) shows that under normal distribution and independence as-sumptions, device linearity and constant precision imply that limits (27) serveto produce a comparison of

√σ21x + σ2 and

√σ22x + σ2, which again provides

only an indirect comparison of process standard deviations σ1x and σ2x.We then recall that the WinBUGS Code Set 3 used in the previous section

to provide an alternative to limits (26) also provides an alternative to limits(27). Further, modification of WinBUGS Code Set 2 for the case of two samplesallows comparison of σ1x and σ2x based on digital data.

3.6. Section Summary Comments

It should now be clear to the reader on the basis of consideration of what arereally fairly simply probability modeling notions of Section 2.2 and the mostelementary methods of standard statistical inference, that

1. how sources of physical variation interact with a data collection plan de-termines what can be learned in a statistical study, and in particular howmeasurement error is reflected in the resulting data,

2. even the most elementary statistical methods have their practical effec-tiveness limited by measurement variation, and

3. even the most elementary statistical methods are helpful in quantifyingthe impact of measurement variation.

In addition, it should be clear that Bayesian methodology (particularly asimplemented using WinBUGS) is a simple and powerful tool for handling inferenceproblems in models that include components intended to reflect measurementerror.

4. Bayesian Computation of Type B Uncertainties

Consider now the program for the Bayesian computation of Type B uncertaintiesoutlined in Section 2.3, based on the model indicated in display (18).



As a very simple example, consider the elementary physics experiment ofdetermining a spring constant. Hooke’s law says that over some range of weights(not so large as to permanently deform the spring and not too small), themagnitude of the change in length ∆l of a steel spring when a weight of massM is hung from it is

k∆l = Mg

for a spring constant, k, peculiar to the spring so that

k =Mg

∆l. (28)

One version of a standard freshman physics laboratory exercise is as fol-lows. For several different physical weights, initial and stretched spring lengthsare measured, and the implied values of k computed via (28). (These are thensomehow combined to produce a single value for k and some kind of uncer-tainty value.) Unfortunately, there is often only a single determination madewith each weight, eliminating the possibility of directly assessing a Type A un-certainty for the lengths, but here we will consider the possibility that several“runs” are made with each physical weight.

In fact, suppose that r different weights are used, weight i with nominal valueof Mg equal to φi and standard uncertainty ui = 0.01φi for its value of Mg.Then suppose that weight i is used ni times and length changes

∆lij for j = 1, 2, . . . , ni

(these each coming from the difference of two measured lengths) are produced. Apossible model here is that actual values of Mg for the weights are independentrandom variables,

wi ∼ N(φi, (0.01φi)

2)

and the length changes ∆lij are independent random variables,

∆lij ∼ N(wi/k, σ

2)

(this following from the kind of considerations in Section 2.2.1 if the lengthmeasuring device is linear with constant precision and the variability of lengthchange doesn’t vary with the magnitude of the change). Then placing inde-pendent U(0,∞) and U(−∞,∞) improper prior distributions on k and lnσrespectively, one can use WinBUGS to simulate from the posterior distribution of(k, σ, φ1, φ2, . . . , φr), the k-marginal of which provides both a measurement fork (the mean or median) and a standard uncertainty (the standard deviation).

The WinBUGS code below can be used to implement the above analysis,demonstrated here for a data set where r = 10 and each ni = 4.

WinBUGS Code Set 4




model {k1~dflat()

k<-abs(k1)logsigma~dflat()

for (i in 1:r) {tau[i]<-1/(phi[i]*phi[i]*(.0001))}

for (i in 1:r) {w[i]~dnorm(phi[i],tau[i])

}sigma<-exp(logsigma)taudeltaL<-exp(-2*logsigma)for (i in 1:r) {mu[i]<-w[i]/k}

for (i in 1:r) {for (j in 1:m) {deltaL[i,j] ~dnorm (mu[i],taudeltaL)

}}

}#here are some example data taken from an online

#Baylor University sample physics lab report found 6/3/10 at

#http://www.baylor.edu/content/services/document.php/110769.pdf

#lengths are in cm and weights(forces) are in Newtons

list(r=10,m=4,

phi=c(.0294,.0491,.0981,.196,.392,.589,.785,.981,1.18,1.37),

deltaL=structure(.Data=c(.94,1.05,1.02,1.03,

1.60,1.68,1.67,1.69,

3.13,3.35,3.34,3.35,

6.65,6.66,6.65,6.62,

13.23,13.33,13.25,13.32,

19.90,19.95,19.94,19.95,

26.58,26.58,26.60,26.60,

33.23,33.25,33.23,33.23,

39.92,39.85,39.83,39.83,

46.48,46.50,46.45,46.43), .Dim=c(10,4)))


list(w=c(.0294,.0491,.0981,.196,.392,.589,.785,.981,1.18,1.37),

k1=2.95,logsigma=-3)

It is potentially of interest that running Code Set 4 produces a posteriormean for k around 2.954 N /m and a corresponding posterior standard devia-tion for k around 0.010 N /m. These values are in reasonable agreement with aconventional (but somewhat ad hoc) analysis in the original data source. The



present analysis has the virtue of providing a completely coherent integration ofthe Type B uncertainty information provided for the magnitudes of the weightsemployed in the lab with the completely empirical measurements of averagelength changes (possessing their own Type A uncertainty) to produce a rationaloverall Type B standard uncertainty for k.

The obvious price to be paid before one is able to employ the kind of Bayesiananalysis illustrated here in the assessment of Type B uncertainty is the devel-opment of non-trivial facility with probability modeling and some familiaritywith modern Bayesian computation. The modeling task implicit in display (18)is not trivial. But there is really no simple-minded substitute for the methodi-cal technically-informed clear thinking required to do the modeling, if a correctprobability-based assessment of uncertainty is desired.

5. Some Intermediate Statistical Inference Methods andMeasurement

5.1. Regression Analysis

Regression analysis is a standard intermediate level statistical methodology. Itis relevant to measurement in several ways. Here we consider its possible usesin the improvement of a measurement system through calibration, and throughthe correction of measurements for the effects of an identifiable and observedvector of variables z other than the measurand.

5.1.1. Simple Linear Regression and Calibration

Consider the situation where (through the use of standards) one essentially“knows” the values of several measurands and can thus hope to compare mea-sured values from one device to those measurands. (And to begin, we considerthe case where there is no identifiable set of extraneous/additional variablesbelieved to affect the distribution of measurement error whose effects one hopesto model.)

A statistical regression model for an observable y and unobserved functionµ (x) (potentially applied to measurement y and measurand x) is

y = µ (x) + ε (29)

for a mean 0 error random variable, ε, often taken to be normal with standarddeviation σ that is independent of x. µ (x) is often assumed to be of a fairlysimple form depending upon some parameter vector β. The simplest commonversion of this model is the “simple linear regression” model where

µ (x) = β0 + β1x . (30)

Under the simple linear regression model for measurement y and measurand x,β1 = 1 means that the device being modeled is linear and δ = β0 is the device



bias. If β1 6= 1, then the device is non-linear in metrology terminology (eventhough the mean measurement is a linear function of the measurand).

Of course, rearranging relationship (30) slightly produces

x =µ (x)− β0

β1.

This suggests that if one knows with certainty the constants β0 and β1, a (lin-early calibrated) improvement on the measurement y (one that largely elimi-nates bias in a measurement y) is

x∗ =y − β0β1

. (31)

This line of reasoning even makes sense where y is not in the same units as x(and thus is not directly an approximation for the measurand). Take for examplea case where temperature x is to be measured by evaluating the resistivityy of some material at that temperature. In that situation, a raw y is not atemperature, but rather a resistance. If average resistance is a linear functionof temperature, then form (31) represents the proper conversion of resistance totemperature (the constant β1 having the units of resistance divided by those oftemperature).

So consider then the analysis of a calibration experiment in which for i =1, 2, . . . , n the value xi is a known value of a measurand and corresponding toxi, a fixed measuring device produces a value yi (that might be, but need notbe, in the same units as x). Suppose that the simple linear regression model

yi = β0 + β1xi + εi (32)

for the εi iid normal with mean 0 and standard deviation σ is assumed to beappropriate. Then the theory of linear models implies that for b0 and b1 theleast squares estimators of β0 and β1, respectively, and

sSLR =

√∑ni=1 (yi − (b0 + b1xi))

2

n− 2,

confidence limits for β1 are

b1 ± tsSLR√∑n

i=1 (xi − x)2

(33)

(for t here and throughout this section a small upper percentage point of thetn−2 distribution) and confidence limits for µ (x) = β0 + β1x for any value of x(including x = 0 and thus the case of β0) are

(b0 + b1x)± tsSLR

√1

n+

(x− x)2∑n

i=1 (xi − x)2 . (34)



Limits (33) and (34) provide some sense as to how much information the (xi, yi)data from the calibration experiment provide concerning the constants β0 andβ1 and indirectly how well the transformation

x∗∗ =y − b0b1

can be expected to do at approximating the transformation (31) giving thecalibrated version of y.

What is more directly relevant to the problem of providing properly calibratedmeasurements with attached appropriate (Type A) uncertainty figures than theconfidence limits (33) and (34) are statistical prediction limits for a new or(n+ 1)st y corresponding to a measurand x. These are well known to be

(b0 + b1x)± tsSLR

√1 +

1

n+

(x− x)2∑n

i=1 (xi − x)2 . (35)

What then is less well known (but true) is that if one assumes that for anunknown x, an (n+ 1)st observation ynew is generated according to the samebasic simple linear regression model (32), a confidence set for the (unknown) xis

{x| the set of limits (35) bracket ynew} . (36)

Typically, the confidence set (36) is an interval that contains

x∗∗new =ynew − b0

b1

that itself serves as a single (calibrated) version of ynew and thus provides TypeA uncertainty for x∗∗new. (The rare cases in which this is not true are thosewhere the calibration experiment leaves large uncertainty about the sign of β1,in which case the confidence set (36) may be of the form (−∞,#) ∪ (##,∞)for two numbers # < ##). Typically then, both an approximately calibratedversion of a measurement and an associated indication (in terms of confidencelimits for x) of uncertainty can be read from a plot of a least squares line andprediction limits (35) for all x, much as suggested in Figure 1.

A Bayesian analysis of a calibration experiment under model (32) producingresults like the frequentist ones is easy enough to implement, but not withoutphilosophical subtleties that have been a source of controversy over the years.One may treat x1, x2, . . . , xn as known constants, model y1, y2, . . . , yn as inde-pendent normal variables with means β0 + β1xi and standard deviation σ, andthen suppose that for fixed/known ynew, a corresponding xnew is normal withmean (ynew − β0) /β1 and standard deviation σ/ |β1|.3 Then upon placing inde-pendent U(−∞,∞) improper prior distributions on β0, β1, and lnσ, one can

3This modeling associated with (xnew, ynew) is perhaps not the most natural that couldbe proposed. But it turns out that what initially seem like potentially more appropriateassumptions lead to posteriors with some quite unattractive properties.



Fig 1. Plot of Data from a Calibration Experiment, Least Squares Line, and Prediction Limitsfor ynew.

use WinBUGS to simulate from the joint posterior distribution of the parametersβ0, β1, and σ, and the unobserved xnew. The xnew-marginal then provides a cal-ibrated measurement associated with the observation ynew and a correspondinguncertainty (in, respectively, the posterior mean and standard deviation of thismarginal).

The WinBUGS code below can be used to implement the above analysis foran n = 6 data set taken from a (no longer available) web page of the School ofChemistry at the University of Witwatersrand developed by D.G. Billing. Mea-sured absorbance values, y, for solutions with “known” Cr6+ concentrations, x(in mg / l) used in the code are, in fact, the source of Figure 1.

WinBUGS Code Set 5

model {beta0~dflat()

beta1~dflat()



logsigma~dflat()

for (i in 1:n) {mu[i]<-beta0+(beta1*x[i])}sigma<-exp(logsigma)tau<-exp(-2*logsigma)for (i in 1:n) {y[i]~dnorm(mu[i],tau)

}munew<-(ynew-beta0)/beta1taunew<-(beta1*beta1)*tauxnew~dnorm(munew,taunew)

}#here are the calibration data

list(n=6,ynew=.2,x=c(0,1,2,4,6,8),y=c(.002,.078,.163,.297,

.464,.600))


list(beta0=0,beta1=.1,logsigma=-4,xnew=3)

The reader can verify that the kind of uncertainty in xnew indicated in Figure1 is completely consistent with what is indicated using the WinBUGS Bayesiantechnology.

5.1.2. Regression Analysis and Correction of Measurements for the Effects ofExtraneous Variables

Another potential use of regression analysis in a measurement context is in thedevelopment of a correction of a measurement y for the effects of an identifiableand observed vector of variables z other than the measurand believed to affectthe error distribution. If measurements of known measurands, x, can be obtainedfor a variety of vectors z, regression analysis can potentially be used to findformulas for appropriate corrections.

Suppose that one observes measurements y1, y2, . . . , yn of measurands x1, x2,. . . , xn under observed conditions z1, z2, . . . ,zn. Under the assumptions that

1. for any fixed z the measurement device is linear (any measurement biasdoes not depend upon x, but only potentially upon z) and

2. the precision of the measurement device is a function of neither the mea-surand nor z (the standard deviation of y for fixed x and z is not a functionof these),

it potentially makes sense to consider regression models of the form

(yi − xi) = β0 +

p∑l=1

βlzil + εi . (37)



The mean (of (yi − xi)) in this statistical model, β0 +∑pl=1 βlzil, is δ (zi), the

measurement bias. Upon application of some form of fitting for model (37) toproduce, say,

δ (zi) = b0 +

p∑l=1

blzil , (38)

a possible approximately-bias-corrected/calibrated version of a measurement ymade under conditions z becomes

y∗ = y − δ (z) .

Ordinary multiple linear regression analysis can be used to fit model (37) toproduce the estimated bias (38). Bayesian methodology can also be used. Onesimply uses independent U(−∞,∞) improper prior distributions for β0, β1, . . . ,βp and lnσ, employs WinBUGS to simulate from the joint posterior distribu-tion of the parameters β0, β1, . . . , βp and σ, and uses posterior means for theregression coefficients as point estimates b0, b1, . . . , bp.

5.2. Shewhart Charts for Monitoring Measurement Device Stability

Shewhart control charts are commonly used for monitoring production processesfor the purpose of change detection. (See, for example, their extensive treatmentin books such as [19] and [20].) That is, they are commonly used to warn thatsomething unexpected has occurred and that an industrial process is no longeroperating in a standard manner. These simple tools are equally useful for mon-itoring the performance of a measurement device over time. That is, for a fixedmeasurand x (associated with some physically stable specimen) that can bemeasured repeatedly with a particular device, suppose that initially the deviceproduces measurements, y, with mean µ (x) and standard deviation σ (x) (that,of course, must usually be estimated through the processing of n measurementsy1, y2, . . . , yn into, most commonly, a sample mean y and sample standard devi-ation s). In what follows, we will take these as determined with enough precisionthat they are essentially “known.”

Then suppose that periodically (say at time i = 1, 2, . . .), one remeasures xand obtains m values y that are processed into a sample mean yi and samplestandard deviation si. Shewhart control charts are plots of yi versus i (theShewhart “x” chart) and si versus i (the Shewhart “s” chart) augmented with“control limits” that separate values of yi or si plausible under a “no change inthe measuring device” model from ones implausible under such a model. Pointsplotting inside the control limits are treated as lack of definitive evidence of achange in the measurement device, while ones plotting outside of those limits aretreated as indicative of a change in measurement. Of course, the virtue to be hadin plotting the points is the possibility it provides of seeing trends potentiallyproviding early warning of (or interpretable patterns in) measurement change.

Standard practice for Shewhart charting of means is to rely upon at leastapproximate normality of yi under the “no change in measurement” model and



set lower and upper control limits at respectively

LCLy = µ (x)− 3σ (x)√m

and UCLy = µ (x) + 3σ (x)√m

. (39)

Something close to standard practice for Shewhart charting of standard devia-tions is to note that normality for measurements y implies that under the “nochange in measurement” model, (m− 1) s2/σ2 (x) is χ2

m−1, and to thus set lowerand upper control limits for si at√

σ2 (x)χ2lower

m− 1and

√σ2 (x)χ2

upper

m− 1(40)

for χ2lower and χ2

upper, respectively, small lower and upper percentage points forthe χ2

m−1 distribution.Plotting of yi with limits (39) is a way of monitoring basic device calibra-

tion. A change detected by this “x” chart suggests that the mean measurementand therefore the measurement bias has drifted over time. Plotting of si withlimits (40) is a way of guarding against unknown change in basic measurementprecision. In particular, si plotting above an upper control limit is evidence ofdegradation in a device’s precision.

There are many other potential applications of Shewhart control charts tometrology. Any statistic that summarizes performance of a process and is newlycomputed based on current process data at regular intervals, might possibly beusefully control charted. So, for example, in situations where a device is reg-ularly recalibrated using the simple linear regression of Section 5.1.1, thoughone expects the fitted values b0, b1, and sSLR to vary period-to-period, appro-priate Shewhart charting of these quantities could be used to alert one to anunexpected change in the pattern of calibrations (potentially interpretable as afundamental change in the device).

5.3. Use of Two Samples to Separate Measurand StandardDeviation and Measurement Standard Deviation

A common problem in engineering applications of statistics is the estimation ofa process standard deviation. As is suggested in Section 2.2.2 and in particulardisplay (7), n measurements y1, y2, . . . , yn of different measurands x1, x2, . . . , xnthemselves drawn at random from a fixed population or process with standarddeviation σx, will vary more than the measurands alone because of measurementerror. We consider here using two (different kinds of) samples to isolate theprocess variation, in a context where a linear device has precision that is constantin x and remeasurement of the same specimen is possible.

So suppose that y1, y2, . . . , yn are as just described and that m additionalmeasurements y′1, y

′2, . . . , y

′m are made for the same (unknown) measurand, x.

Under the modeling of Sections 2.2.1 and 2.2.2 and the assumptions of linearityand constant precision, for s2y the sample variance of y1, y2, . . . , yn and s2y′ the



sample variance of y′1, y′2, . . . , y

′m, the first of these estimates σ2

x + σ2 and thesecond estimates σ2, suggesting

σx =

√max

(0, s2y − s2y′

)as an estimate of σx. The so-called “Satterthwaite approximation” (that es-

sentially treats σx2

as if it were a multiple of a χ2 distributed variable andestimates both the degrees of freedom and the value of the multiplier) thenleads to approximate confidence limits for σx of the form

σx

√ν

χ2upper

and/or σx

√ν

χ2lower

(41)

for

ν =σx

4

s4yn− 1

+s4y′

m− 1

and χ2upper and χ2

lower percentage points for the χ2ν distribution. This method is

approximate at best. A more defensible way of doing inference for σx is througha simple Bayesian analysis.

That is, it can be appropriate to model y1, y2, . . . , yn as iid normal variableswith mean µy and variance σ2

x + σ2 independent of y′1, y′2, . . . , y

′m modeled as

iid normal variables with mean µy′ and variance σ2. Then using (independentimproper) U(−∞,∞) priors for all of µy, µy′ , lnσx, and lnσ one might useWinBUGS to find posterior distributions, with focus on the marginal posterior ofσx in this case. Some example code for a case with n = 10 and m = 7 is next.(The data are real measurements of the sizes of some binder clips made with avernier micrometer. Units are mm above 32.00 mm.)

WinBUGS Code Set 6

model {muy~dflat()

logsigmax~dflat()

sigmax<-exp(logsigmax)sigmasqx<-exp(2*logsigmax)muyprime~dflat()

logsigma~dflat()

sigma<-exp(logsigma)sigmasq<-exp(2*logsigma)tau<-exp(-2*logsigma)sigmasqy<-sigmasqx+sigmasqtauy<-1/sigmasqyfor (i in 1:n) {



y[i]~dnorm(muy,tauy)

}for (i in 1:m) {yprime[i]~dnorm(muyprime,tau)

}}#here are some real data

list(n=10,y=c(-.052,-.020,-.082,-.015,-.082,-.050,-.047,-.038,

-.083,-.030),

m=7,yprime=c(-.047,-.049,-.052,-.050,-.052,-.057,-.047))


list(muy=-.05,logsigmax=-3.6,muyprime=-.05,logsigma=-5.6)

It is perhaps worth checking that at least for the case illustrated in CodeSet 6 (where σ appears to be fairly small in comparison to σx) the Bayesian95% credible interval for the process/measurand standard deviation σx is insubstantial agreement with nominally 95% limits (41).

5.4. One-Way Random Effects Analyses and Measurement

A standard model of intermediate level applied statistics is the so called “one-way random effects model.” This, for

wij = the jth observation in an ith sample of ni observations

employs the assumption that

wij = µ+ αi + εij (42)

for µ an unknown parameter, α1, α2, . . . , αI iid normal random variables withmean 0 and standard deviation σα independent of ε11, ε12, . . . , ε1n1

, ε21, . . . ,εI−1,nI−1

, εI1, . . . , εInIthat are iid normal with mean 0 and standard deviation

σ. Standard statistical software can be used to do frequentist inference for theparameters of this model (µ, σα, and σ) and Code Set 7 below illustrates thefact that a corresponding Bayesian analysis is easily implemented in WinBUGS.In that code, we have specified independent improper U(−∞,∞) priors for µand lnσ, but have used a (proper) “inverse-gamma” prior for σ2

α. It turns outthat in random effects models, U(−∞,∞) priors for log standard deviations(except that of the ε’s) fails to produce a legitimate posterior. In the presentcontext, for reasonably large I (say I ≥ 8) an improper U(0,∞) prior can beeffectively used for σα. Gelman discusses this issue in detail in [7].

WinBUGS Code Set 7

model {



MU~dflat()

logsigma~dflat()

sigma<-exp(logsigma)tau<-exp(-2*logsigma)taualpha~dgamma(.001,.001)

sigmaalpha<-1/sqrt(taualpha)for (i in 1:I) {mu[i]~dnorm(MU,taualpha)

}for (k in 1:n) {w[k]~dnorm(mu[sample[k]],tau)

}}

#The data here are Argon sensitivities for a mass spectrometer

#produced on 3 different days, taken from a 2006 Quality

#Engineering paper of Vardeman, Wendelberger and Wang

list(n=44,I=3,w=c(31.3,31.0,29.4,29.2,29.0,28.8,28.8,27.7,27.7,

27.8,28.2,28.4,28.7,29.7,30.8,30.1,29.9,32.5,32.2,31.9,30.2,

30.2,29.5,30.8,30.5,28.4,28.5,28.8,28.8,30.6,31.0,31.7,29.8,

29.6,29.0,28.8,29.6,28.9,28.3,28.3,28.3,29.2,29.7,31.1),

sample=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,

2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3))


list(MU=30,logsigma=0,taualpha=1,mu=c(30,30,30))

There are at least two important engineering measurement contexts wherethe general statistical model (42) and corresponding data analyses are relevant:

1. a single measurement device is used to produce measurements of multiplemeasurands drawn from a stable process multiple times apiece, or where

2. one measurand is measured multiple times using each one of a sample ofmeasurement devices drawn from a large family of such devices.

The second of these contexts is common in engineering applications where onlyone physical device is used, but data groups correspond to multiple operatorsusing that device. Let us elaborate a bit on these two contexts and the corre-sponding use of the one-way random effects model.

In context 1, take

yij = the jth measurement on item i drawn from a fixed

population of items or process producing items .

If, as in Section 2.2.2,

xi = the measurand for the ith item



and the xi are modeled as iid with mean µx and variance σ2x, linearity and

constant precision of the measurement device then make it plausible to modelthese measurands as independent of iid measurement errors yij − xi that havemean (bias) δ and variance σ2. Then with the identifications

µ = µx + δ, αi = xi − µx, and εij = (yij − xi)− δ ,

wij in display (42) is yij and adding normal distribution assumptions producesan instance of the one-way normal random effects model with σα = σx. Thenfrequentist or Bayesian analyses for the one-way model applied to the yij pro-duce ways (more standard than that developed in Section 5.3) of separatingprocess variation from measurement variation and estimating the contributionof each to overall variability in the measurements.

In context 2, take

yij = the jth measurement on a single item made

using randomly selected device or method i .

Then as in Section 2.2.3 suppose that (for the single measurand under consid-eration) bias for method/device i is δi. It can be appropriate to model the δi asiid with mean µδ and variance σ2

δ . Then with the identifications

µ = x+ µδ, αi = δi − µδ, and εij = (yij − x)− δi ,

wij in display (42) is yij and adding normal distribution assumptions producesan instance of the one-way normal random effects model with σα = σδ. Then fre-quentist or Bayesian analyses for the one-way model applied to the yij producesa way of separating σδ from σ and estimating the contribution of each to overallvariability in the measurements. We note once more that, in the version of thiswhere what changes device-to-device is only the operator using a piece of equip-ment, σ is often called a “repeatability” standard deviation and σδ, measuringas it does operator-to-operator variation, is usually called a “reproducibility”standard deviation.

A final observation here comes from the formal similarity of the applicationof one-way methods to scenarios 1 and 2, and the fact that Section 5.3 providesa simple treatment for scenario 1. Upon proper reinterpretation, it must thenalso provide a simple treatment for scenario 2, and in particular for separatingrepeatability and reproducibility variation. That is, where a single item is mea-sured once each by n different operators and then m times by a single operator,the methodologies of Section 5.3 could be used to estimate (not σx but rather)σδ and σ, providing the simplest possible introduction to the engineering topicof “gauge R&R.”

5.5. A Generalization of Standard One-Way Random EffectsAnalyses and Measurement

The model (42), employing as it does a common standard deviation σ across allindices i, could potentially be generalized by assuming that the standard devi-ation of εij is σi (where σ1, σ2, . . . , σI are potentially different). With effects αi



random as in Section 5.4, one then has a model with parameters µ, σα, σ1, σ2, . . . ,and σI . This is not a standard model of intermediate statistics, but handlinginference under it in a Bayesian fashion is really not much harder than handlinginference under the usual one-way random effects model. Bayesian analysis (us-ing independent improper U(−∞,∞) priors for all of µ, lnσ1, . . . , lnσI and asuitable proper prior for σα per [7]) is easily implemented in WinBUGS by makingsuitable small modifications of Code Set 7.

A metrology context where this generalized one-way random effects model isuseful is that of a round robin study, where the same measurand is measuredat several laboratories with the goals of establishing a consensus value for the(unknown) measurand and lab-specific assessments of measurement precision.For

wij = yij = the jth of ni measurements made at lab i ,

one might take µ as the ideal (unknown) consensus value, σα a measure of lab-to-lab variation in measurement, and each σi as a lab i precision. (Note that ifthe measurand were known, the lab biases µ+ αi − x would be most naturallytreated as unknown model parameters.)

5.6. Two-Way Random Effects Analyses and Measurement

Several models of intermediate-level applied statistics concern random effectsand observations naturally thought of as comprising a two-way arrangementof samples. (One might envision samples laid out in the cells of some two-wayrow-by-column table.) For

wijk = the kth observation in a sample of nij observations in the

ith row and jth column of a two-way structure ,

we consider random effects models that recognize the two-way structure of thesamples. (Calling rows levels of some factor, A, and columns levels of somefactor, B, it is common to talk about A and B effects on the cell means.) Thiscan be represented through assumptions that

wijk = µ+ αi + βj + εijk (43)

orwijk = µ+ αi + βj + αβij + εijk (44)

under several possible interpretations of the αi, βj , and potentially the αβij (µis always treated as an unknown parameter, and the εijk are usually taken to beiid N

(0, σ2

)). Standard “random effects” assumptions on the terms in (43) or

(44) are that the effects of a given type are iid random draws from some mean 0normal distribution. Standard “fixed effects” assumptions on the terms in (43)or (44) are that effects of a given type are unknown parameters.

A particularly important standard application of two-way random effectsanalyses in measurement system capability assessment is to gauge R&R studies.



In the most common form of such studies, each of I different items/parts aremeasured several times by each of J different operators/technicians with a fixedgauge as a way of assessing measurement precision obtainable using the gauge.With parts on rows and operators on columns, the resulting data can be thoughtof as having two-way structure. (It is common practice to make all nij the same,though nothing really requires this or even that all nij > 0, except for theavailability of simple formulas and/or software for frequentist analysis.) So wewill here consider the implications of forms (43) and (44) in a case where

wijk = yijk = the kth measurement of part i by operator j .

Note that for all versions of the two-way model applied to gauge R&R studies,the standard deviation σ quantifies variability in measurement of a given partby a given operator, the kind of measurement variation called “repeatability”variation in Sections 3.2 and 5.4.

5.6.1. Two-Way Models Without Interactions and Gauge R&R Studies

To begin, (either for all αi and βj fixed effects or conditioning on the values ofrandom row and column effects) model (43) says that the mean measurementon item i by operator j is

µ+ αi + βj . (45)

Now, if one assumes that each measurement “device” consisting of operatorj using the gauge being studied (in a standard manner, under environmentalconditions that are common across all measurements, etc.) is linear, then forsome operator-specific bias δj the mean measurement on item i by operator j is

xi + δj (46)

(for xi the ith measurand).Treating the column effects in display (45) as random and averaging produces

an average row i cell mean (under the two-way model with no interactions)

µ+ αi

and treating the operator biases in display (46) as random with mean µδ pro-duces a row i mean operator average measurement

xi + µδ .

Then combining these, it follows that applying a random-operator-effects two-way model (43) under an operator linearity assumption, one is effectively as-suming that

αi = xi + µδ − µ

and thus that differences in αi’s are equally differences in measurands xi (andso, if the row effects are assumed to be random, σα = σx).



In a completely parallel fashion, applying a random-part-effects two-waymodel (43) under an operator linearity assumption, one is assuming that

βj = δj + µx − µ

and thus that differences in βj ’s are equally differences in operator-specific biasesδj (and so, if column effects are assumed to be random, σβ = σδ a measure of“reproducibility” variation in the language of Sections 2.2.3 and 5.4).

5.6.2. Two-Way Models With Interactions and Gauge R&R Studies

The relationship (46) of Section 5.6.1, appropriate when operator-gauge “de-vices” are linear, implies that the no-interaction form (45) is adequate to rep-resent any set of measurands and operator biases. So if the more complicatedrelationship

µ+ αi + βj + αβij (47)

for the mean measurement on item i by operator j (or conditional mean in theevent that row, column, or cell effects are random) from model (44) is requiredto adequately model measurements in a gauge R&R study, the interactions αβijmust quantify non-linearity of operator-gauge devices.

Consider then gauge R&R application of the version of the two-way modelwhere all of row, column, and interaction effects are random. σαβ (describing asit does the distribution of the interactions) is a measure of overall non-linearity,and inference based on the model (44) that indicates that this parameter issmall would be evidence that the operator-gauge devices are essentially linear.

Lacking linearity, a reasonable question is what should be called “repro-ducibility” variation in cases where the full complexity of the form (47) is re-quired to adequately model gauge R&R data. An answer to this question can bemade by drawing a parallel to the one-way development of Section 5.4. Fixingattention on a single part, one has (conditional on that part) exactly the kindof data structure discussed in Section 5.4. The group-to-group standard devia-tion (σα in the notation of the previous section) quantifying variation in groupmeans, is in the present modeling√

σ2β + σ2

αβ

quantifying the variation in the sums βj+αβij (which are what change cell meanto cell mean in form (47) as one moves across cells in a fixed row of a two-waytable). It is thus this parametric function that might be called σreproducibility inan application of the two-way random effects model with interactions to gaugeR&R data.

Further, it is sensible to define the parametric function

σR&R =√σ2 + σ2

reproducibility =√σ2β + σ2

αβ + σ2 ,



the standard deviation associated by the two-way random effects model withinteractions with (for a single fixed part) a single measurement made by a ran-domly chosen operator. With this notation, the parametric functions σR&R andits components σrepeatability = σ and σreproducibility are of interest to practition-ers.

5.6.3. Analyses of Gauge R&R Data

Two-way random effects analogues of the Bayesian analyses presented before inthis exposition are more or less obvious. With independent improper U(−∞,∞)priors for all fixed effects and the log standard deviation of ε’s and appropriateproper priors for the other standard deviations, it is possible and effective to useWinBUGS to find posterior distributions for quantities of interest (for example,σR&R from Section 5.6.2). For more details and some corresponding WinBUGS

code, see [22].

6. Concluding Remarks

The intention here has been to outline the relevance of statistics to metrologyfor physical science and engineering in general terms. The particular statisticalmethods and applications to metrology introduced in Sections 3, 4, and 5 barelyscratch the surface of what is possible or needed. This area has huge potential forboth stimulating important statistical research and providing real contributionsto the conduct of science and engineering. In conclusion here, we mention (witheven less detail than we have provided in what has gone before) some additionalopportunities for statistical collaboration and contribution in this area.

6.1. Other Measurement Types

Our discussion in this article has been carried primarily by reference to uni-variate and essentially real-valued measurements and simple statistical methodsfor them. But these are appropriate in only the simplest of the measurementcontexts met in modern science and engineering. Sound statistical handling ofmeasurement variability in other data types is also needed and in many casesnew modeling and inference methodology is needed to support this.

In one direction, new work is needed in appropriate statistical treatment ofmeasurement variability in the deceptively simple-appearing case of univariateordinal “measurements” (and even categorical “measurements”). Reference [6]is an important recent effort in this area and makes some connections to relatedliterature in psychometrics.

In another direction, modern physical measurement devices increasingly pro-duce highly multivariate essentially real-valued measurements. Sometimes theindividual coordinates of these concern fundamentally different physical prop-erties of a specimen, so that their indexing is more or less arbitrary and ap-propriate statistical methodology needs to be invariant to reordering of the



coordinates. In such cases, classical statistical multivariate analysis is poten-tially helpful. But the large sample sizes needed to reliably estimate large meanvectors and covariance matrices do not make widespread successful metrologyapplications of textbook multivariate analysis seem likely.

On the other hand, there are many interesting modern technological multi-variate measurement problems where substantial physical structure gives nat-ural subject matter meaning to coordinate indexing. In these cases, probabil-ity modeling assumptions that tie successive coordinates of multivariate datatogether in appropriate patterns can make realistic sample sizes workable forstatistical analysis. One such example is the measurement of weight fractions(of a total specimen weight) of particles of a granular material falling into aset of successive size categories. See [13] for a recent treatment of this problemthat addresses the problem of measurement errors. Another class of problems ofthis type concerns the analysis of measurements that are effectively (discretelysampled versions of ) “functions” of some variable, t, (that could, for example,be time, or wavelength, or force, or temperature, etc.).

Specialized statistical methodology for these applications needs to be de-veloped in close collaboration with technologists and metrologists almost on acase-by-case basis, every new class of measuring machine and experiment call-ing for advances in modeling and inference. Problems seemingly as simple asthe measurement of 3-d position and orientation (essential to fields from man-ufacturing to biomechanics to materials science) involve multivariate data andvery interesting statistical considerations, especially where measurement erroris to be taken into account. For example, measurement of a 3-d location usinga coordinate measuring machine has error that is both location-dependent and(user-chosen) probe path-dependent (matters that require careful characteriza-tion and modeling for effective statistical analysis). 3-d orientations are mostdirectly described by orthogonal matrices with positive determinant, and re-quire non-standard probability models and inference methods for the handlingof measurement error. (See, for example, reference [2] for an indication of whatis needed and can be done in this context.)

Finally, much of modern experimental activity in a number of fields (includ-ing drug screening, “combinatorial chemistry,” and materials science) follows ageneral pattern called “high throughput screening,” in which huge numbers ofexperimental treatments are evaluated simultaneously so as to “discover” oneor a few that merit intensive follow-up. Common elements of these experimentsinclude a very large number of measurements, and complicated or (relative totraditional experiments) unusual blocking structures corresponding to, for exam-ple, microarray plates and patterns among robotically produced measurements.In this context, serious attention is needed in data modeling and analysis thatadequately accounts for relationships between measurement errors and/or othersources of random noise, often regarded as “nuisance effects.”



6.2. Measurement and Statistical Experimental Design and Analysis

Consideration of what data have the most potential for improving understand-ing of physical systems is a basic activity of statistics. This is the sub-area ofstatistical design and analysis of experiments. We have said little about metrol-ogy and statistically informed planning of data collection in this article. Butthere are several ways that this statistical expertise can be applied (and furtherdeveloped) in collaborations with metrologists.

In the first place, in developing a new measurement technology, it is de-sirable to learn how to make it insensitive to all factors except the value of ameasurand. Experimentation is a primary way of learning how that can be done,and statistical experimental design principles and methods can be an importantaid in making that effort efficient and effective. Factorial and fractional facto-rial experimentation and associated data analysis methods can be employed in“ruggedness testing” to see if environmental variables (in addition to a measur-and) impact the output of a measurement system. In the event that there aresuch variables, steps may be taken to mitigate their effects. Beyond the possi-ble simple declaration to users that the offending variables need to be held atsome standard values, there is the possibility of re-engineering the measurementmethod in a way that makes it insensitive to the variables. Sometimes logic andphysical principles provide immediate guidance in either insulating the measure-ment system from changes in the variables or eliminating their impact. Othertimes experimentation may be needed to find a configuration of measurementsystem parameters in which the variables are no longer important, and againstatistical design and analysis methods become relevant. This latter kind ofthinking is addressed in [5].

Another way in which statistical planning has the potential to improve mea-surement is by informing the choice of which measurements are made. Thisincludes but goes beyond ideas of simple sample size choices. For example, sta-tistical modeling and analysis can inform choice of targets for touching a physicalsolid with the probe of coordinate measuring machine if characterization of theshape or position of the object is desired. Statistical modeling and analysis caninform the choice of a set of sieve sizes to be used in characterizing the particlesize distribution of a granular solid. And so on.

6.3. The Necessity of Statistical Engagement

Most applied statisticians see themselves as collaborators with scientists andengineers, planning data collection and executing data analysis in order to helplearn “what is really going on.” But data don’t just magically appear. They cometo scientists and engineers through measurement and with measurement error.It then only makes sense that statisticians understand and take an interest inhelping mitigate the “extra” uncertainty their scientific colleagues face as usersof imperfect measurements. We hope that this article proves useful to many injoining these efforts.



References

[1] Automotive Industry Action Group. (2010). Measurement SystemsAnalysis Reference Manual, 4th ed. Chrysler Group LLC, Ford Motor Com-pany, and General Motors Corporation.

[2] Bingham, M., Vardeman, S., and Nordman, D. (2009). Bayes one-sample and one-way random effects analyses for 3-d orientations with appli-cation to materials science. Bayesian Analysis 4, 3, 607–630. DOI:10.1214/09-BA423.

[3] Burr, T., Hamada, M., Cremers, T., Weaver, B., Howell, J.,Croft, S., and Vardeman, S. (2011). Measurement error models andvariance estimation in the presence of rounding effects. Accreditation andQuality Assurance 16, 7, 347–359. DOI:10.1007/s00769-011-0791-0.

[4] Croarkin, M. C. (2001). Statistics and measurements. Journal of Researchof the National Institute of Standards and Technology 106, 1, 279–292.

[5] Dasgupta, T., Miller, A., and Wu, C. (2010). Robust design of mea-surement systems. Technometrics 52, 1, 80–93.

[6] de Mast, J. and van Wieringen, W. (2010). Modeling and evaluating re-peatability and reproducibility of ordinal classifications. Technometrics 52, 1,94–99.

[7] Gelman, A. (2006). Prior distributions for variance parameters in hierar-chical models. Bayesian Analysis 1, 3, 515–533.

[8] Gertsbakh, I. (2002). Measurement Theory for Engineers. Springer-Verlag, Berlin-Heildelberg-New York.

[9] Gleser, L. (1998). Assessing uncertainty in measurement. Statistical Sci-ence 13, 3, 277–290.

[10] Joint Committee for Guides in Metrology Working Group1. (2008). Evaluation of Measurement Data - Guide to the Expressionof Uncertainty in Measurement. JCGM 100:2008 (GUM 1995 with mi-nor corrections). International Bureau of Weights and Measures, Sevres.http://www.bipm.org/utils/common/documents/jcgm/JCGM 100 2008 E.pdf.

[11] Lee, C.-S. and Vardeman, S. (2001). Interval estimation of a normalprocess mean from rounded data. Journal of Quality Technology 33, 3, 335–348.

[12] Lee, C.-S. and Vardeman, S. (2002). Interval estimation of a normalprocess standard deviation from rounded data. Communications in Statis-tics 31, 1, 13–34.

[13] Leyva, N., Page, G., Vardeman, S., and Wendelberger, J. (2011).Bayes statistical analyses for particle sieving studies. Technometrics, submit-ted. LA-UR-10-06017.

[14] Lunn, D. J., Thomas, A., Best, N., and Spiegelhalter, D. (2000).WinBUGS - a Bayesian modelling framework: Concepts, structure, and ex-tensibility. Statistics and Computing 10, 4, 325–337.

[15] Morris, A. S. (2001). Measurement and Instrumentation Principles, 3rded. Butterworth-Heinemann, Oxford.

[16] National Institute of Standards and Technology.



(2003). NIST/SEMATECH e-Handbook of Statistical Methods.http://www.it1.nist.gov/div898/handbook, July 25, 2011.

[17] Taylor, B. and Kuyatt, C. (1994). Guidelines for Evaluating and Ex-pressing the Uncertainty of NIST Measurement Results–NIST Technical Note1297. National Institute of Standards and Technology, Gaithersburg, MD.http://www.nist.gov/pml/pubs/tn1297/index.cfm.

[18] Vardeman, S. (2005). Sheppard’s correction for variances and the quan-tization noise model. IEEE Transactions on Instrumentation and Measure-ment 54, 5, 2117–2119.

[19] Vardeman, S. and Jobe, J. M. (1999). Statistical Quality AssuranceMethods for Engineers. John Wiley, New York.

[20] Vardeman, S. and Jobe, J. M. (2001). Basic Engineering Data Collec-tion and Analysis. Duxbury/Thomson Learning, Belmont.

[21] Vardeman, S., Wendelberger, J., Burr, T., Hamada, M., Moore,L., Morris, M., Jobe, J. M., and Wu, H. (2010). Elementary statisticalmethods and measurement. The American Statistician 64, 1, 52–58.

[22] Weaver, B., Hamada, M., Wilson, A., and Vardeman, S. (2011). ABayesian approach to the analysis of gauge R&R data. IIE Transactions,submitted. LA-UR-11-00438.


an introduction to statistical issues and methods in ......an introduction to statistical issues and...

Documents