an introduction to statistical issues and methods in ...vardeman/stat415/stat and... · an...

51
An Introduction to Statistical Issues and Methods in Metrology for Physical Science and Engineering Stephen Vardeman 1 , Michael Hamada 2 , Tom Burr 2 , Max Morris 1 , Joanne Wendelberger 2 , J. Marcus Jobe 3 , Leslie Moore 2 , and Huaiqing Wu 1 1 Iowa State University, 2 Los Alamos National Lab, 3 Miami University (Ohio) August 5, 2011 Abstract Statistical science and physical metrology are inextricably intertwined. Measurement quality a/ects what can be learned from data collected and processed using statistical methods, and appropriate data collection and analysis quanties the quality of physical measurements. Metrologists have long understood this and have often developed their own statistical methodologies and emphases. Some statisticians, notably those work- ing in standards organizations such as the National Institute of Stan- dards and Technology (NIST), at national laboratories, and in quality assurance and statistical organizations in manufacturing concerns have contributed to good measurement practice through the development of statistical methodologies appropriate to the use of ever-more-complicated measurement devices and to the quantication of measurement quality. In this regard, contributions such as the NIST e-Handbook (NIST 2003), the AIAG MSA Manual (Automotive Industry Action Group (2010)), and Gertsbakh (2002) are meant to provide guidance in statistical practice to measurement practitioners, and Croarkin (2001) is a nice survey of metrology work done by statisticians at NIST. Our purpose here is to provide a unied overview of the interplay be- tween statistics and physical measurement for readers with roughly a rst year graduate level background in statistics. We hope that most parts of this will also be accessible and useful to many scientists and engineers with somewhat less technical background in statistics, providing introduc- tion to the best existing technology for the assessment and expression of measurement uncertainty. Our experience is that while important, this Development supported by NSF grant DMS #0502347 EMSW21-RTG awarded to the Department of Statistics, Iowa State University. 1

Upload: vantu

Post on 14-Jul-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

An Introduction to Statistical Issues andMethods in Metrology for Physical Science and

Engineering∗

Stephen Vardeman1, Michael Hamada2, Tom Burr2,Max Morris1, Joanne Wendelberger2, J. Marcus Jobe3,

Leslie Moore2, and Huaiqing Wu1

1Iowa State University, 2Los Alamos National Lab, 3Miami University (Ohio)

August 5, 2011

Abstract

Statistical science and physical metrology are inextricably intertwined.Measurement quality affects what can be learned from data collected andprocessed using statistical methods, and appropriate data collection andanalysis quantifies the quality of physical measurements. Metrologistshave long understood this and have often developed their own statisticalmethodologies and emphases. Some statisticians, notably those work-ing in standards organizations such as the National Institute of Stan-dards and Technology (NIST), at national laboratories, and in qualityassurance and statistical organizations in manufacturing concerns havecontributed to good measurement practice through the development ofstatistical methodologies appropriate to the use of ever-more-complicatedmeasurement devices and to the quantification of measurement quality. Inthis regard, contributions such as the NIST e-Handbook (NIST 2003), theAIAG MSA Manual (Automotive Industry Action Group (2010)), andGertsbakh (2002) are meant to provide guidance in statistical practiceto measurement practitioners, and Croarkin (2001) is a nice survey ofmetrology work done by statisticians at NIST.

Our purpose here is to provide a unified overview of the interplay be-tween statistics and physical measurement for readers with roughly a firstyear graduate level background in statistics. We hope that most partsof this will also be accessible and useful to many scientists and engineerswith somewhat less technical background in statistics, providing introduc-tion to the best existing technology for the assessment and expression ofmeasurement uncertainty. Our experience is that while important, this

∗Development supported by NSF grant DMS #0502347 EMSW21-RTG awarded to theDepartment of Statistics, Iowa State University.

1

material is not commonly known to even quite competent statisticians.So while we don’t claim originality (and surely do not provide a com-prehensive summary of all that has been done on the interface betweenstatistics and metrology), we do hope to bring many important issues tothe attention of a statistical community broader than that traditionallyinvolved in work with metrologists.

We refer to a mix of frequentist and Bayes methodologies in this ex-position (the latter implemented in WinBUGS (see Lunn et al (2000))).Our rationale is that while some simple situations in statistical metrologycan be handled by well-established and completely standard frequentistmethods, many others really call for analyses based on nonstandard sta-tistical models. Development and exposition of frequentist methods forthose problems would have to be handled on a very “case-by-case”basis,whereas the general Bayesian paradigm and highly flexible WinBUGS soft-ware tool allow de-emphasis of case-by-case technical considerations andconcentration on matters of modeling and interpretation.

Contents

1 Introduction 31.1 Basic Concepts of Measurement/Metrology . . . . . . . . . . . . 31.2 Probability and Statistics and Measurement . . . . . . . . . . . . 71.3 Some Complications . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Probability Modeling and Measurement 102.1 Use of Probability to Describe Empirical Variation and Uncertainty 102.2 High-Level Modeling of Measurements . . . . . . . . . . . . . . . 12

2.2.1 A Single Measurand . . . . . . . . . . . . . . . . . . . . . 122.2.2 Measurands From a Stable Process or Fixed "Population" 132.2.3 Multiple Measurement "Methods" . . . . . . . . . . . . . 142.2.4 Quantization/Digitalization/Rounding and Other Inter-

val Censoring . . . . . . . . . . . . . . . . . . . . . . . . . 152.3 Low-Level Modeling of "Computed" Measurements . . . . . . . . 17

3 Simple Statistical Inference and Measurement (Type A Uncer-tainty Only) 193.1 Frequentist and Bayesian Inference for a Single Mean . . . . . . . 193.2 Frequentist and Bayesian Inference for a Single Standard Devia-

tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 Frequentist and Bayesian Inference From a Single Digital Sample 253.4 Frequentist and Bayesian "Two-Sample" Inference for a Differ-

ence in Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.5 Frequentist and Bayesian "Two-Sample" Inference for a Ratio of

Standard Deviations . . . . . . . . . . . . . . . . . . . . . . . . . 303.6 Section Summary Comments . . . . . . . . . . . . . . . . . . . . 31

4 Bayesian Computation of Type B Uncertainties 31

2

5 Some Intermediate Statistical Inference Methods and Measure-ment 345.1 Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.1.1 Simple Linear Regression and Calibration . . . . . . . . . 345.1.2 Regression Analysis and Correction of Measurements for

the Effects of Extraneous Variables . . . . . . . . . . . . . 385.2 Shewhart Charts for Monitoring Measurement Device Stability . 385.3 Use of Two Samples to Separate Measurand Standard Deviation

and Measurement Standard Deviation . . . . . . . . . . . . . . . 405.4 One-Way Random Effects Analyses and Measurement . . . . . . 415.5 A Generalization of Standard One-Way Random Effects Analyses

and Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.6 Two-Way Random Effects Analyses and Measurement . . . . . . 44

5.6.1 Two-Way Models Without Interactions and Gauge R&RStudies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.6.2 Two-WayModels With Interactions and Gauge R&R Stud-ies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.6.3 Analyses of Gauge R&R Data . . . . . . . . . . . . . . . . 47

6 Concluding Remarks 476.1 Other Measurement Types . . . . . . . . . . . . . . . . . . . . . . 486.2 Measurement and Statistical Experimental Design and Analysis . 496.3 The Necessity of Statistical Engagement . . . . . . . . . . . . . . 50

7 References 50

1 Introduction

1.1 Basic Concepts of Measurement/Metrology

A fundamental activity in all of science, engineering, and technology is measure-ment. Before one can learn from empirical observation or use scientific theoriesand empirical results to engineer and produce useful products, one must be ableto measure all sorts of physical quantities. The ability to measure is an obviousprerequisite to data collection in any statistical study. On the other hand,statistical thinking and methods are essential to the rational quantification ofthe effectiveness of physical measurement.This article is intended as an MS-level introduction for statisticians to metrol-

ogy as an important field of application of and interaction with statistical meth-ods. We hope that it will also provide graduate-level scientists and engineers(possessing a substantial background in probability and statistics) access to thebest existing technology for the assessment and expression of measurement un-certainty.

3

We begin with some basic terminology, notation, and concerns of metrology.

Definition 1 Ameasurand is a physical quantity whose value, x, is of interestand for which some well-defined set of physical steps produce a measurement,y, a number intended to represent the measurand.

A measurand is often a feature or property of some particular physical ob-ject, such as "the diameter" of a particular turned steel shaft. But in scientificstudies it can also be a more universal and abstract quantity, such as the speedof light in a vacuum or the mass of an electron. In the simplest cases, mea-surands are univariate, i.e. x ∈ R (and often x > 0), though as measurementtechnology advances, more and more complicated measurands and correspond-ing measurements can be contemplated, including vector or even functional xand y (such as mass spectra in chemical analyses). (Notice that per Definition1, a measurand is a physical property, not a number. So absolutely precisewording would require that we not call x a measurand. But we will allow thisslight abuse of language in this exposition, and typically call x a measurandrather than employ the clumsier language "value of a measurand.")The use of the different notations x and y already suggests the fundamental

fact that rarely (if ever) does one get to know a measurand exactly on the basisof real world measurement. Rather, the measurand is treated as conceptu-ally unknown and unknowable and almost surely not equal to a correspondingmeasurement. There are various ways one might express the disparity betweenmeasurand and measurement. One is in terms of a simple arithmetic difference.

Definition 2 The difference e = y − x will be called a measurement error.

The development of effective measurement methods is the development ofways of somehow assuring that measurement errors will be "small." This, forone thing, involves the invention of increasingly clever ways of using physicalprinciples to produce indicators of measurands. (For example, Morris (2001)is full of discussions of various principles/methods that have been invented formeasuring physical properties from voltage to viscosity to pH.) But once a rel-evant physical principle has been chosen, development of effective measurementalso involves somehow identifying (whether through the use of logic alone or ad-ditionally through appropriate data collection and analysis) and subsequentlymitigating the effects of important physical sources of measurement error.(For example, ISO standards for simple micrometers identify such sources asscale errors, zero-point errors, temperature induced errors, and anvil parallelismerrors as relevant if one wants to produce an effective micrometer of the typeroutinely used in machine shops.)In the development of any measurement method, several different ideas of

"goodness" of measurement arise. These include the following.

Definition 3 A measurement or measuring method is said to be valid if itusefully or appropriately represents the measurand.

4

Definition 4 A measurement system is said to be precise if it produces smallvariation in repeated measurement of the same measurand.

Definition 5 A measurement system is said to be accurate (or sometimesunbiased) if on average (across a large number of measurements) it producesthe true or correct value of a measurand.

While the colloquial meanings of the words "validity," "precision," and "ac-curacy" are perhaps not terribly different, it is essential that their technicalmeanings be kept straight and the words used correctly by statisticians andengineers.Validity, while a qualitative concept, is the first concern when developing a

measurement method. Without it, there is no point in proceeding to considerthe quantitative matters of precision or accuracy. The issue is whether a methodof measurement will faithfully portray the physical quantity of interest. Whendeveloping a new pH meter, one wants a device that will react to changes inacidity, not to changes in temperature of the solution being tested or to changesin the amount of light incident on the container holding the solution.Precision of measurement has to do with getting similar values every time

a measurement of a particular quantity is obtained. It is consistency or re-peatability of measurement. A bathroom scale that can produce any numberbetween 150 lb and 160 lb when one gets on it repeatedly is really not veryuseful. After establishing that a measurement system produces valid measure-ments, consistency of those measurements is needed. Precision is largely anintrinsic property of a measurement method or system. After all possible stepshave been taken to mitigate important physical sources of measurement vari-ation, there is not really any way to "adjust" for poor precision or to remedyit except to 1) overhaul or replace measurement technology or to 2) averagemultiple measurements. (Some implications of this second possibility will beseen later, when we consider simple statistical inference for means.)But validity and precision together don’t tell the whole story regarding the

usefulness of real-world measurements. The issue of accuracy remains. Doesthe measurement system or method produce the "right" value on average? Inorder to assess this, one needs to reference the system to an accepted standardof measurement. The task of comparing a measurement method or system toa standard one and, if necessary, working out conversions that will allow themethod to produce "correct" (converted) values on average is called calibra-tion. In the United States, the National Institute of Standards and Technology(NIST) is responsible for maintaining and disseminating consistent standards(items whose measurands are treated as "known" from measurement via thebest available methods) for calibrating measurement equipment.An analogy that is sometimes helpful in remembering the difference between

accuracy and precision of measurement is that of target shooting. Accuracy intarget shooting has to do with producing a pattern centered on the bull’s eye(the ideal). Precision has to do with producing a tight pattern (consistency).In Definitions 4 and 5, the notions of "repeated measurement" and "aver-

age" are a bit nebulous. They refer somewhat vaguely to the spectrum of

5

measurement values that one might obtain when using the method under dis-cussion in pursuit of a single measurand. This must depend upon a carefuloperational definition of the "method" of measurement. The more loosely thisis defined (for example by allowing for different "operators" or days of mea-surement, or batches of chemical reagent used in analysis, etc.) the larger theconceptual spectrum of outcomes that must be considered (and, for example,correspondingly the less precise in the sense of Definition 4 will be the method).Related to Definition 5 is the following.

Definition 6 The bias of a measuring method for evaluating a measurand isthe difference between the measurement produced on average and the value ofthe measurand.

The ideal is, of course, that measurement bias is negligible. If a particularmeasurement method has a known and consistent bias across some set of mea-surands of interest, how to adjust a measurement for that bias is obvious. Onemay simply subtract the bias from an observed measurement (thereby produc-ing a higher-level "method" with no (or little) bias). The possibility of suchconstant bias has its own terminology.

Definition 7 A measurement method or system is called linear if it has nobias or if bias is constant in the measurand.

Notice that this language is a bit more specialized than the ordinary math-ematical meaning of the word "linear." What is required by Definition 7 isnot simply that average y be a linear function of x, but that the slope of thatlinear relationship be 1. (Then, under Definition 7, the y-intercept is the biasof measurement.)Because a measurement method or system must typically be used across

time, it is important that its behavior does not change over time. When thatis true, we might then employ the following language.

Definition 8 A measurement method or system is called stable if both its pre-cision and its bias for any measurand are constant across time.

Some measurements are produced in a single fairly simple/obvious/directstep. Others necessarily involve computation from several more basic quanti-ties. (For example, the density of a liquid can only be determined by measuringa weight/mass of a measured volume of the liquid.) In this latter case, it iscommon for some of the quantities involved in the calculation to have "uncer-tainties" attached to them whose bases may not be directly statistical, or if theyare based on data, derive from data not available when a measurement is beingproduced. It is then not absolutely obvious (and a topic of later discussion inthis exposition) how one might combine information from data in hand withsuch uncertainties to arrive at an appropriate quantification of uncertainty ofmeasurement. It is useful in the discussion of these issues to have terminologyfor the nature of uncertainties associated with basic quantities used to computea measurement.

6

Definition 9 A numerical uncertainty associated with an input to the com-putation of a measurement is of Type A if it is statistical/derived entirely fromcalculation based on observations (data) available to the person producing themeasurement. If an uncertainty is not of Type A, it is of Type B.

Taylor and Kuyatt (1994) state that "Type B evaluation of standard uncer-tainty is usually based on scientific judgment using all the relevant informationavailable, which may include

• previous measurement data,

• experience with, or general knowledge of, the behavior and property ofrelevant materials and instruments,

• manufacturer’s specifications,

• data provided in calibration and other reports, and

• uncertainties assigned to reference data taken from handbooks."

1.2 Probability and Statistics and Measurement

The subjects of probability and statistics have intimate connections to the the-ory and practice of measurement. For completeness and the potential aid ofreaders who are not statisticians, let us say clearly what we mean by referenceto these subjects.

Definition 10 Probability is the mathematical theory intended to describechance/randomness/random variation.

The theory of probability provides a language and set of concepts and resultsdirectly relevant to describing the variation and less-than-perfect predictabilityof real world measurements.

Definition 11 Statistics is the study of how best to

1. collect data,

2. summarize or describe data, and

3. draw conclusions or inferences based on data,

all in a framework that recognizes the reality and omnipresence of variation inphysical processes.

How sources of physical variation interact with a (statistical) data collectionplan governs how measurement error is reflected in the resulting data set (andultimately what of practical importance can be learned). On the other hand,statistical efforts are an essential part of understanding, quantifying, and im-proving the quality of measurement. Appropriate data collection and analysis

7

provides ways of identifying (and ultimately reducing the impact of) sources ofmeasurement error.The subjects of probability and statistics together provide a framework for

clear thinking about how sources of measurement variation and data collectionstructures combine to produce observable variation, and conversely how observ-able variation can be decomposed to quantify the importance of various sourcesof measurement error.

1.3 Some Complications

The simplest discussions of modeling, quantifying, and to the extent possiblemitigating the effects of measurement error do not take account of some veryreal and important complications that arise in many scientific and engineeringmeasurement contexts. We will try to consider how some of these can behandled as this exposition proceeds. For the time being, we primarily simplynote their potential presence and the kind of problems they present.Some measurement methods are inherently destructive and cannot be used

repeatedly for a given measurand. This makes evaluation of the quality of ameasurement system problematic at best, and only fairly indirect methods ofassessing precision, bias, and measurement uncertainty are available.It is common for sensors used over time to produce measurements that ex-

hibit "carry-over" effects from "the last" measurement produced by the sensor.Slightly differently put, it is common for measured changes in a physical state tolag actual changes in that state. This effect is known as hysteresis. For exam-ple, temperature sensors often read "high" in dynamic conditions of decreasingtemperature and read "low" in dynamic conditions of increasing temperature,thereby exhibiting hysteresis.In developing a measurement method, one wants to eliminate any impor-

tant systematic effects on measurements of recognizable variables other thanthe measurand. Where some initial version of a measurement method doesproduce measurements depending in a predictable way upon a variable besidesthe measurand and that variable can itself be measured, the possibility exists ofcomputing and applying an appropriate correction for the effects of that vari-able. For example, performance of a temperature sensor subject to hysteresiseffects might be improved by adjusting raw read temperatures in light of tem-peratures read at immediately preceding time periods.The simplest measurement contexts (in both conceptual and operational

terms) are those where a method has precision that does not depend upon themeasurand. "Constant variance" statistical models and methods are simplerand more widely studied than those that allow for non-constant variance. Butwhere precision of measurement is measurand-dependent, it is essential to recog-nize that fact in modeling and statistical analysis. One of the great virtues ofthe Bayes statistical paradigm that we will employ in our discussion is its abilityto incorporate features such as non-constant variance into an analysis in moreor less routine fashion.

8

Real world measurements are always expressed only "to some number ofdigits." This is especially obvious in a world where gauges are increasinglyequipped with digital displays. But then, if one conceives of measurands asreal (infinite number of decimal places) numbers, this means that there is so-called quantization error or digitalization error to be accounted for whenmodeling and using measurements. When one reports a measurement as 4.1 mmwhat is typically meant is "between 4.05 mm and 4.15 mm." So, strictly speak-ing, 4.1 is not a real number, and the practical meaning of ordinary arithmeticoperations applied to such values is not completely clear. Sometimes this issuecan be safely ignored, but in other circumstances it cannot. Ordinary statisti-cal methods of the sort presented in essentially every elementary introductionto the subject are built on an implicit assumption that the numbers used arereal numbers. A careful handling of the quantization issue therefore requiresrethinking of what are appropriate statistical methods, and we will find the no-tion of interval censoring/rounding to be helpful in this regard. (We issuea warning at this early juncture that despite the existence of a large literatureon the subject, much of it in electrical engineering venues, we consider the stan-dard treatment of quantization error as independent of the measurement anduniformly distributed, to be thoroughly unsatisfactory.)Quantization might be viewed as one form of "coarsening" of observations

in a way that potentially causes some loss of information available from mea-surement. (This is at least true if one adopts the conceptualization that cor-responding to a real-valued measurand there is a conceptual perfect real-valuedmeasurement that is converted to a digital response in the process of report-ing.) There are other similar but more extreme possibilities in this directionthat can arise in particular measurement contexts. There is often the possibil-ity of encountering a measurand that is "off the scale" of a measuring method.It can be appropriate in such contexts to treat the corresponding observation as"left-censored" at the lower end of the measurement scale or "right censored"at the upper end. A potentially more problematic circumstance arises in somechemical analyses, where an analyst may record a measured concentration onlyas below some "limit of detection."1 This phrase often has a meaning lessconcrete and more subjective than simply "off scale" and reported limits ofdetection can vary substantially over time and are often not accompanied bygood documentation of laboratory circumstances. But unless there is an un-derstanding of the process by which a measurement comes to be reported asbelow a particular numerical value that is adequate to support the developmentof a probability model for the case, little can be done in the way of formal useof such observations in statistical analyses.Related to the notion of coarsening of observations is the possibility that final

measurements are produced only on some ordinal scale. At an extreme, a test

1This use of the terminology "limit of detection" is distinct from a second one common inanalytical chemistry contexts. If a critical limit is set so that a "blank" sample will rarely bemeasured above this limit, the phrase "lower limit of detection" is sometimes used to meanthe smallest measurand that will typically produce a measurement above the critical limit.See pages 28-34 of Vardeman and Jobe (1999) in this regard.

9

method might produce a pass/fail (0/1) measurement. (It is possible, but notnecessary, that such a result comes from a check as to whether some real-numbermeasurement is in an interval of interest.) Whether ordinal measurements arecoarsened versions of conceptual real-number measurements or are somehowarrived at without reference to any underlying interval scale, the meaning ofarithmetical operations applied to them is unclear; simple concepts such as biasas in Definition 6 are not directly relevant. The modeling of measurement errorin this context requires something other than the most elementary probabilitymodels, and realistic statistical treatments of such measurements must be madein light of these less common models.

2 Probability Modeling and Measurement

2.1 Use of Probability to Describe Empirical Variationand Uncertainty

We will want to use the machinery of probability theory to describe variouskinds of variation and uncertainty in both measurement and the collection ofstatistical data. Most often, we will build upon continuous univariate and jointdistributions. This is realistic only for real-valued measurands and measure-ments and reflects the extreme convenience and mathematical tractability ofsuch models. We will use notation more or less standard in graduate levelstatistics courses, except possibly for the fact that we will typically not employcapital letters for random variables. The reader can determine from contextwhether a lower case letter is standing for a particular random variable or for arealized/possible value of that random variable.Before going further, we need to make a philosophical digression and say

that at least two fundamentally different kinds of things might get modeled withthe same mathematical formalisms. That is, a probability density f (y) for ameasurement y can be thought of as modeling observable empirical variationin measurement. This is a fairly concrete kind of modeling. A step removedfrom this might be a probability density f (x) for unobservable, but neverthelessempirical variation in a measurand x (perhaps across time or across differentitems upon which measurements are taken). Hence inference about f (x) mustbe slightly indirect and generally more diffi cult than inference about f (y) butis important in many engineering contexts. (We are here abusing notation in acompletely standard way and using the same name, f , for different probabilitydensity functions (pdf’s).)A different kind of application of probability is to the description of un-

certainty. That is, suppose φ represents some set of basic variables that areinvolved in the calculation of a measurement y and one has no direct observa-tion(s) on φ, but rather only some single value for it and uncertainty statements(of potentially rather unspecified origin) for its components. One might wantto characterize φ with some (joint) probability density. In doing so, one isnot really thinking of that distribution as representing potential empirical vari-

10

ation in φ, but rather the state of one’s ignorance (or knowledge) of it. Withsuffi cient characterization, this uncertainty can be "propagated" through themeasurement calculation to yield a resulting measure of uncertainty for y.

While few would question the appropriateness of using probability to modeleither empirical variation or (rather loosely defined) "uncertainty" about someinexactly known quantity, there could be legitimate objection to combining thetwo kinds of meaning in a single model. In the statistical world, controversyabout simultaneously modeling empirical variation and "subjective" knowledgeabout model parameters was historically the basis of the "Bayesian-frequentistdebate." As time has passed and statistical models have increased in complexity,this debate has faded in intensity and, at least in practical terms, has beenlargely carried by the Bayesian side (that takes as legitimate the combining ofdifferent types of modeling in a single mathematical structure). We will seethat in some cases, "subjective" distributions employed in Bayesian analysesare chosen to be relatively "uninformative" and ultimately produce inferenceslittle different from frequentist ones.In metrology, the question of whether Type A and Type B uncertainties

should be treated simultaneously (effectively using a single probability model)has been answered in the affi rmative by the essentially universally adoptedGuideto the Expression of Uncertainty in Measurement originally produced by theInternational Organization for Standardization (Joint Committee for Guides inMetrology Working Group 1 (2008)). As Gleser (1998) has said, the recommen-dations made in the so-called "GUM" (guide to uncertainty in measurement)"can be regarded as approximate solutions to certain frequentist and Bayesianinference problems."In this exposition we will from time to time and without further discussion

use single probability models, part of which describe empirical variation andpart of which describe uncertainty. We will further (as helpful) employ bothfrequentist and Bayesian statistical analyses, the latter especially because oftheir extreme flexibility and ability to routinely handle inferences for which noother methods have been developed in the existing statistical literature.As a final note in this section, we point out what are the standard choices

of models for Type B uncertainty. It seems to be fairly standard in metrologythat where a univariate variable φ is announced as having value φ∗ and some"standard uncertainty" u, probability models entertained for φ are

• normal with mean φ∗ and standard deviation u, or

• uniform on(φ∗ −

√3u, φ∗ +

√3u)(and therefore with mean φ∗ and stan-

dard deviation u), or

• symmetrically triangular on(φ∗ −

√6u, φ∗ +

√6u)(and therefore again

with mean φ∗ and standard deviation u).

11

2.2 High-Level Modeling of Measurements

2.2.1 A Single Measurand

Here we introduce probability notation that we will employ for describing ata high level the measurement of a single, isolated, conceptually real-numbermeasurand x to produce a conceptually real-number measurement y and mea-surement error e = y − x. In the event that the distribution of measurementerror is not dependent upon identifiable variables other than the measuranditself, it is natural to model y with some (conditional on x) probability density

f (y|x) (1)

which then leads to a (conditional) mean and variance for y

E [y|x] = µ (x) and Var [y|x] = Var [e|x] = σ2 (x) .

Ideally, a measurement method is unbiased (i.e. perfectly accurate) and

µ (x) = x, i.e., E [e|x] = 0 ,

but when that cannot be assumed, one might well call

δ (x) = µ (x)− x = E [e|x]

the measurement bias. Measurement method linearity requires that δ (x) isconstant, i.e., δ (x) ≡ δ. Where measurement precision is constant, one cansuppress the dependence of σ2 (x) upon x and write simply σ2.Somewhat more complicated notation is appropriate when the distribution

of measurement error is known to depend on some identifiable and observedvector of variables z. It is then natural to model y with some (conditional onboth x and z) probability density

f (y|x, z) (2)

(we continue to abuse notation and use the same name, f , for a variety ofdifferent functions) which in this case leads to a (conditional) mean and variance

E [y|x, z] = µ (x, z) and Var [y|x, z] = Var [e|x, z] = σ2 (x, z) .

Here the measurement bias is

δ (x, z) = µ (x, z)− x = E [e|x, z]

and potentially depends upon both x and z. This latter possibility is clearlyproblematic unless one can either

1. hold z constant at some value z0 and reduce bias to (at worst) a functionof x alone, or

12

2. so fully model the dependence of the mean measurement on both x andz that one can for a particular z = z∗ observed during the measurementprocess apply the corresponding function of x, δ (x, z∗), in the interpreta-tion of a measurement.

Of course, precision that depends upon z is also unpleasant and would call forcareful modeling and statistical analysis, and (especially if z itself is measuredand thus not perfectly known) the kind of correction suggested in 2 will inpractice not be exact.Should one make multiple measurements of a single measurand, say y1, y2, . . . ,

yn, the simplest possible modeling of these is as independent random variableswith joint pdf either

f (y|x) =

n∏i=1

f (yi|x)

in the case the measurement error distribution does not depend upon identifiablevariables besides the measurand and form (1) is used, or

f (y|x, (z1, z2, . . . ,zn)) =

n∏i=1

f (yi|x, zi)

in the case of form (2) where there is dependence of the form of the measurementerror distribution upon z.

2.2.2 Measurands From a Stable Process or Fixed "Population"

In engineering contexts, it is fairly common to measure multiple items all comingfrom what one hopes is a physically stable process. In such a situation, thereare multiple measurands and one might well conceive of them as generated in aniid (independently identically distributed) fashion from some fixed distribution.For any one of the measurands, x, it is perhaps plausible to suppose that x haspdf (describing empirical variation)

f (x) (3)

and based on calculation using this distribution we will adopt the notation

Ex = µx and Varx = σ2x . (4)

In such engineering contexts, these process parameters (4) are often of as muchinterest as the individual measurands they produce.Density (3) together with conditional density (1) produce a joint pdf for an

(x, y) pair and marginal moments for the measurement

Ey = EE [y|x] = E (x+ δ (x)) = µx + Eδ (x) (5)

andVary = VarE [y|x] + EVar [y|x] = Varµ (x) + Eσ2 (x) . (6)

13

Now relationships (5) and (6) are, in general, not terribly inviting. Notehowever that in the case the measurement method is linear (δ (x) ≡ δ andµ (x) =E[y|x] = x+ δ) and measurement precision is constant in x, displays (5)and (6) reduce to

Ey = µx + δ and Vary = σ2x + σ2 . (7)

Further observe that the marginal pdf for y following from densities (3) and (1)is

f (y) =

∫f (y|x) f (x) dx . (8)

The variant of this development appropriate when the distribution of mea-surement error is known to depend on some identifiable and observed vector ofvariables z has

E [y|z] = E [E [y|x, z] |z] = E [x+ δ (x, z) |z] = µx + E [δ (x, z) |z]

and

Var [y|z] = Var [E [y|x, z] |z] + E [Var [y|x, z] |z]

= Var [µ (x, z) |z] + E[σ2 (x, z) |z

]. (9)

Various simplifying assumptions about bias and precision can lead to simplerversions of these expressions. The pdf of y|z following from (3) and (2) is

f (y|z) =

∫f (y|x, z) f (x) dx . (10)

Where single measurements are made on iid measurands x1, x2, . . . , xn andthe form of measurement error distribution does not depend on any identifiableadditional variables beyond the measurand, a joint (unconditional) density forthe corresponding measurements y1, y2, . . . , yn is

f (y) =

n∏i=1

f (yi) ,

for f (y) of the form (8). A corresponding joint density based on the form (10)for cases where the distribution of measurement error is known to depend onidentifiable and observed vectors of variables z1, z2, . . . ,zn is

f (y| (z1, z2, . . . ,zn)) =

n∏i=1

f (yi|zi) .

2.2.3 Multiple Measurement "Methods"

The notation z introduced in Section 2.2.1 can be used to describe several dif-ferent kinds of "non-measurand effects" on measurement error distributions. In

14

some contexts, z might stand for some properties of a measured item that donot directly affect the measurand associated with it, but do affect measurement.In others, z might stand for ambient or environmental conditions present whena measurement is made that are not related to the measurand. In other con-texts, z might stand for some details of measurement protocol or equipment orpersonnel that again do not directly affect what is being measured, but do im-pact the measurement error distribution. In this last context, different valuesof z might be thought of as effectively producing different measurement devicesor "methods" and in fact assessing the importance of z to measurement (andmitigating any large potentially worrisome effects so as to make measurementmore "consistent") is an important activity.One instance of this last possibility is that where z is a qualitative variable,

taking one of, say, J different values j = 1, 2, . . . , J in a measurement study.In an engineering production context, each possible value of z might identifya different human operator that will use a particular gauge and protocol tomeasure some geometric characteristic of a metal part. In this context, changesin δ (x, z) as z changes is often called reproducibility variation. Where eachδ (x, j) is assumed to be constant in x (each operator using the gauge producesa linear measurement "system") with say δ (x, j) ≡ δj , it is reasonably commonto model the δj as random and iid with variance σ2δ , a so-called reproducibilityvariance. This parameter becomes something that one wants to learn aboutfrom data collection and analysis. Where it is discovered to be relatively large,remedial action usually focuses on training operators to take measurements "thesame way" through improved protocols, fixturing devices, etc.In a similar way, it is common for measurement organizations to run so-called

round-robin studies involving a number of different laboratories, all measuringwhat is intended to be a standard specimen (with a common measurand) inorder to evaluate lab-to-lab variability in measurement. In the event that oneassumes that all of labs j = 1, 2, . . . , J measure in such a way that δ (x, j) ≡ δj ,it is variability among, or differences in, these lab biases that is of first interestin a round-robin study.

2.2.4 Quantization/Digitalization/Rounding and Other Interval Cen-soring

For simplicity and without real loss of generality, suppose now that while mea-surement y is conceptually real-valued, it is only observed to the nearest integer.(If a measurement is observed to the kth decimal, then multiplication by 10k

produces an integer-valued observed measurement with units 10−k times theoriginal units.) In fact, let byc stand for the integer-rounded version of y. Thevariable byc is discrete and for integer i the model (1) for y implies that formeasurand x

P [byc = i |x] =

∫ i+.5

i−.5f (y|x) dy (11)

15

and the model (2) implies that for measurand x and some identifiable andobserved vector of variables z affecting the distribution of measurement error

P [byc = i |x, z] =

∫ i+.5

i−.5f (y|x, z) dy . (12)

(Limits of integration could be changed to i and i + 1 in situations where therounding is "down" rather than to the nearest integer.)Now, the bias and precision properties of the discrete variable byc under

the distribution specified by either (11) or (12) are not the same as those ofthe conceptually continuous variable y specified by (1) or (2). For example, ingeneral for (11) and (1)

E [byc |x] 6= µ (x) and Var [byc |x] 6= σ2 (x) .

Sometimes the differences between the means and between the variances of thecontinuous and digital versions of y are small, but not always. The safestroute to a rational analysis of quantized data is through the use of distributionsspecified by (11) or (12).We remark (again) that a quite popular but technically incorrect (and po-

tentially quite misleading) method of recognizing the difference between y andbyc is through what electrical engineers call "the quantization noise model."This treats the "quantization error"

q = byc − y (13)

as a random variable uniformly distributed on (−.5, .5) and independent of y.This is simply an unrealistic description of q, which is a deterministic functionof y (hardly independent of y!!) and rarely uniformly distributed. (Some im-plications of these issues are discussed in elementary terms in Vardeman (2005).A more extensive treatment of the matter can be found in Burr et al (2011).)It is worth noting that the notation (13) provides a conceptually appealing

breakdown of a digital measurement as

byc = x+ e+ q .

With this notation, a digital measurement is thus the measurand plus a mea-surement error plus a digitalization error.Expressions (11) or (12) have their natural generalizations to other forms of

interval censoring/coarsening of a conceptually real-valued measurement y. Iffor a set of intervals {Ii} (finite or infinite in number and/or extent), when yfalls into Ii one does not learn the value of y, but only that y ∈ Ii, appropriateprobability modeling of what is observed replaces conditional probability den-sity f (y|x) on R with conditional density f (y|x) on R\ ∪ {Ii} and the set ofconditional probabilities

P [y ∈ Ii |x] =

∫Ii

f (y|x) dy or P [y ∈ Ii |x, z] =

∫Ii

f (y|x, z) dy .

16

For example, if a measurement device can read out only measurements betweena and b, values below amight be called "left-censored" and those to the right of bmight be called "right censored." Sensible probability modeling of measurementwill be through a probability density, its values on (a, b), and its integrals from−∞ to a and from b to ∞.

2.3 Low-Level Modeling of "Computed" Measurements

Consider now how probability can be used in the more detailed/low-level mod-eling of measurements derived as functions of several more basic quantities. Infact suppose that a univariate measurement, y, is to be derived through the useof a measurement equation

y = m (θ,φ) , (14)

where m is a known function and empirical/statistical/measured values of the

entries of θ = (θ1, θ2, . . . , θK) (say, θ̂ =(θ̂1, θ̂2, . . . , θ̂K

)) are combined with

externally-provided values of the entries of φ = (φ1, φ2, . . . , φL) to produce

y = m(θ̂,φ

). (15)

For illustration, consider determining the cross-sectional area A, of a lengthl, of copper wire by measuring its electrical resistance, R, at 20 ◦C. Physicaltheory says that the resistivity ρ of a material specimen with constant cross-section at a fixed temperature is

ρ = RA

l

so that

A =ρl

R. (16)

Using the value of ρ for copper available from a handbook and measuring valuesl and R, it’s clear that one can obtain a measured value of A through the useof the measurement equation (16). In this context, θ = (l, R) and φ = ρ.Then, we further suppose that Type B uncertainties are provided for the

elements of φ, say the elements of u = (u1, u2, . . . , uL). Finally, suppose thatType A standard errors are available for the the elements of θ̂, say the elementsof s = (s1, s2, . . . , sK). The GUM prescribes that a "standard uncertainty"associated with y in display (15) be computed as

u =

√√√√ K∑i=1

(∂y

∂θi

∣∣∣∣(θ̂,φ)

)2s2i +

L∑i=1

(∂y

∂φi

∣∣∣∣(θ̂,φ)

)2u2i . (17)

Formula (17) has the form of a "1st order delta-method approximation"to the standard deviation of a random quantity m (θ∗,φ∗) defined in terms of

17

the function m and independent random vectors θ∗ and φ∗ whose mean vec-tors are respectively θ̂ and φ and whose covariance matrices are respectivelydiag

(s21, s

22, . . . , s

2K

)and diag

(u21, u

22, . . . , u

2L

). Note that the standard uncer-

tainty (17) (involving as it does the Type B ui’s) is a "Type B" quantity.We have already said that the GUM has answered in the affi rmative the

implicit philosophical question of whether Type A and Type B uncertainties canlegitimately be combined into a single quantitative assessment of uncertainty ina computed measurement, and (17) is one manifestation of this fact. It remainsto produce a coherent statistical rationale for something similar to (15) and (17).The Bayesian statistical paradigm provides one such rationale. In this regardwe offer the following.Let w stand for data collected in the process of producing the measurement

y, and f (w|θ) specify a probability model for w that depends upon θ as a(vector) parameter. Suppose that one then provides a "prior" (joint) probabilitydistribution for θ and φ that is meant to summarize one’s state of knowledge orignorance about these vectors before collecting the data. We will assume thatthis distribution is specified by some joint "density"

g (θ,φ) .

(This could be a pdf, a probability mass function (pmf), or some hybrid, spec-ifying a partially continuous and partially discrete joint distribution.) Theproduct

f (w|θ) g (θ,φ) (18)

treated as a function of θ and φ (for the observed data w plugged in) is thenproportional to a posterior (conditional on the data) joint density for θ and φ.This specifies what is (at least from a Bayesian perspective) a legitimate proba-bility distribution for θ and φ. This posterior distribution in turn immediatelyleads (at least in theory) through equation (14) to a posterior probability dis-tribution for m (θ,φ). Then, under suitable circumstances2 , forms (15) and(17) are potential approximations to the posterior mean and standard deviationof m (θ,φ). Of course, a far more direct route of analysis is to simply takethe posterior mean (or median) as the measurement and the posterior standarddeviation as the standard uncertainty.As this exposition progresses we will provide some details as to how a full

Bayesian analysis can be carried out for this kind of circumstance using theWinBUGS software to do Markov chain Monte Carlo simulation from the posteriordistribution. Our strong preference for producing Type B uncertainties in thiskind of situation is to use the full Bayesian methodology and posterior standarddeviations in preference to the more ad hoc quantity (17).

2For example, if m is "not too non-linear," the ui are "not too big," the prior is one ofindependence of θ and φ, the prior for φ makes its components independent with meansthe externally prescribed values and variances u2i , the prior for θ is "flat," and estimatedcorrelations between estimates of the elements of θ based on the likelihood f (w|θ) are small,then (15) and (17) will typically be decent approximations for respectively the posterior meanand standard deviation of m (θ,φ).

18

3 Simple Statistical Inference andMeasurement(Type A Uncertainty Only)

While quite sophisticated statistical methods have their place in some mea-surement applications, many important issues in statistical inference and mea-surement can be illustrated using very simple methods. So rather than begindiscussion with complicated statistical analyses, we begin by considering howbasic statistical inference interacts with basic measurement. Some of the dis-cussion of Sections 3.1, 3.2, 3.4, and 3.5 below is a more general and technicalversion of material that appeared in Vardeman et al (2010).

3.1 Frequentist and Bayesian Inference for a Single Mean

Probably the most basic method of statistical inference is the (frequentist) tconfidence interval for a mean, that computed from observations w1, w2, . . . , wnwith sample mean w and sample standard deviation sw has endpoints

w ± t sw√n

(19)

(where t is a small upper percentage point of the tn−1 distribution). Theselimits (19) are intended to bracket the mean of the data-generating mechanismthat produces the wi. The probability model assumption supporting formula(19) is that the wi are iid N

(µw, σ

2w

), and it is the parameter µw that is under

discussion. Careful thinking about measurement and probability modelingreveals a number of possible real-world meanings for µw, depending upon thenature of the data collection plan and what can be assumed about the measuringmethod. Among the possibilities are at least

1. a measurand plus measurement bias, if the wi are measurements yi maderepeatedly under fixed conditions for a single measurand,

2. an average measurand plus measurement bias, if the wi are measurementsyi for n different measurands that are themselves drawn from a stableprocess, under the assumption of measurement method linearity (constantbias) and constant precision,

3. a measurand plus average bias, if the wi are measurements yi made fora single measurand, but with randomly varying and uncorrected-for vec-tors of variables zi that potentially affect bias, under the assumption ofconstant measurement precision,

4. a difference in measurement method biases (for two linear methods withpotentially different but constant associated measurement precisions) ifthe wi are differences di = y1i − y2i between measurements made usingmethods 1 and 2 for n possibly different measurands.

19

Let us elaborate somewhat on contexts 1—4 based on the basics of Section 2.2,beginning with the simplest situation of context 1.Where repeat measurements y1, y2, . . . , yn for measurand x are made by the

same method under fixed physical conditions, the development of Section 2.2.1is relevant. An iid model with marginal mean x + δ (x) (or x + δ (x, z0) forfixed z0) and marginal variance σ2 (x) (or σ2 (x, z0) for fixed z0) is plausible,and upon either making a normal distribution assumption or appealing to thewidely known robustness properties of the t interval, limits (19) applied to then measurements yi serve to estimate x+ δ (x) (or x+ δ (x, z0)).Note that this first application of limits (19) provides a simple method of

calibration. If measurand x is "known" because it corresponds to some sort ofstandard specimen, limits

y ± t sy√n

for x+ δ (x) correspond immediately to limits

(y − x)± t sy√n

for the bias δ (x).Consider then context 2. Where single measurements y1, y2, . . . , yn for mea-

surands x1, x2, . . . , xn (modeled as generated in an iid fashion from a distribu-tion with Ex = µx and Varx = σ2x) are made by a linear device with constantprecision, the development of Section 2.2.2 is relevant. If measurement errorsare modeled as iid with mean δ (the fixed bias) and variance σ2, display (7)gives marginal mean and variance for the (iid) measurements, Ey = µx + δ andVary = σ2x+σ2. Then again appealing to either normality or robustness, limits(19) applied to the n measurements yi serve to estimate µx + δ.

Next, consider context 3. As motivation for this case, one might take a sit-uation where n operators each provide a single measurement of some geometricfeature of a fixed metal part using a single gauge. It is probably too muchto assume that these operators all use the gauge exactly the same way, andso an operator-dependent bias, say δ (x, zi), might be associated with operatori. Denote this bias as δi. Arguing as in the development of Section 2.2.3,one might suppose that measurement precision is constant and the δi are iidwith Eδ = µδ and Varδ = σ2δ and independent of measurement errors That inturn gives Ey = x + µδ and Vary = σ2δ + σ2, and limits (19) applied to n iidmeasurements yi serve to estimate x+ µδ.Finally consider context 4. If a single measurand produces y1 using Method

1 and y2 using Method 2, under an independence assumption for the pair (andproviding the error distributions for both methods are unaffected by identifiablevariables besides the measurand), the modeling of Section 2.2.1 prescribes thatthe difference be treated as having E(y1 − y2) = (x+ δ1 (x)) − (x+ δ2 (x)) =δ1 (x) − δ2 (x) and Var(y1 − y2) = σ21 (x) + σ22 (x). Then, if the two methodsare linear with constant precisions, it is plausible to model the differences di =y1i − y2i as iid with mean δ1 − δ2 and variance σ21 + σ22 , and limits (19) applied

20

to the n differences serve to estimate δ1 − δ2 and thereby provide a comparisonof the biases of the two measurement methods.It is an important (but largely ignored) point that if one applies limits (19)

to iid digital/quantized measurements by1c , by2c , . . . , bync, one gets inferencesfor the mean of the discrete distribution of byc, and NOT for the mean of theconceptually continuous distribution of y. As we remarked in Section 2.2.4,these means may or may not be approximately the same. In Section 3.3 we willdiscuss methods of inference different from limits (19) that can always be usedto employ quantized measurements in inference for the mean of the continuousdistribution.The basic case of inference for a normal mean represented by the limits (19)

offers a simple context in which to make a first reasonably concrete illustrationof Bayesian inference. That is, an alternative to the frequentist analysis leadingto the t confidence limits (19) is this. Suppose that w1, w2, . . . , wn are modeledas iid N

(µw, σ

2w

)and let

f(w|µw, σ2w

)be the corresponding joint pdf for w. If one subsequently adopts an "improperprior" (improper because it specifies a "distribution" with infinite mass, not aprobability distribution) for

(µw, σ

2w

)that is a product of a "Uniform(−∞,∞)"

distribution for µw and a Uniform(−∞,∞) distribution for ln (σw), the corre-sponding "posterior distribution" (distribution conditional on data w) for µwproduces probability intervals equivalent to the t intervals. That is, settingγw = ln (σw) one might define a "distribution" for (µw, γw) using a density onR2

g (µw, γw) ≡ 1

and a "joint density" for w and (µw, γw)

f (w|µw, exp (2γw)) · g (µw, γw) = f (w|µw, exp (2γw)) . (20)

Provided n ≥ 2, when treated as a function of (µw, γw) for observed valuesof wi plugged in, display (20) specifies a legitimate (conditional) probabilitydistribution for (µw, γw). (This is despite the fact that it specifies infinitetotal mass for the joint distribution of w and (µw, γw).) The correspondingmarginal posterior distribution for µw is that of w + Tsw/

√n for T ∼ tn−1,

which leads to posterior probability statements for µw operationally equivalentto frequentist confidence statements.We will, in this exposition, use the WinBUGS software as a vehicle for enabling

Bayesian computation and where appropriate provide some WinBUGS code. Hereis some code for implementing the Bayesian analysis just outlined, demonstratedfor an n = 5 case.

WinBUGS Code Set 1

#here is the model statementmodel {

21

muw~dflat()logsigmaw~dflat()sigmaw<-exp(logsigmaw)tauw<-exp(-2*logsigmaw)for (i in 1:N) {W[i]~dnorm(muw,tauw)}#WinBUGS parameterizes normal distributions with the#second parameter inverse variances, not variances}#here are some hypothetical datalist(N=5,W=c(4,3,3,2,3))#here is a possible initializationlist(muw=7,logsigmaw=2)

Notice that in the code above, w1 = 4, w2 = 3, w3 = 3, w4 = 2, w5 = 3are implicitly treated as real numbers. (See Section 3.3 for an analysis thattreats them as digital.) Computing with these values, one obtains w = 3.0 andsw =

√1/2. So using t4 distribution percentage points, 95% confidence limits

(19) for µw are

3.0± 2.7760.7071√

5,

i.e.,2.12 and 3.88 .

These are also 95% posterior probability limits for µw based on structure (20).In this simple Bayesian analysis, the nature of the posterior distribution can beobtained either exactly from the examination of the form (20), or in approxi-mate terms through the use of WinBUGS simulations. As we progress to morecomplicated models and analyses, we will typically need to rely solely on thesimulations.

3.2 Frequentist and Bayesian Inference for a Single Stan-dard Deviation

Analogous to the t limits (19) are χ2 confidence limits for a single standarddeviation that computed from observations w1, w2, . . . , wn with sample standarddeviation sw are

sw

√n− 1

χ2upperand/or sw

√n− 1

χ2lower(21)

(for χ2upper and χ2lower respectively small upper and lower percentage points of

the χ2n−1 distribution). These limits (21) are intended to bracket the standarddeviation of the data-generating mechanism that produces the wi and are basedon the model assumption that the wi are iid N

(µw, σ

2w

). Of course, it is the

22

parameter σw that is thus under consideration and there are a number of possiblereal-world meanings for σw, depending upon the nature of the data collectionplan and what can be assumed about the measuring method. Among these areat least

1. a measurement method standard deviation if the wi are measurements yimade repeatedly under fixed conditions for a single measurand,

2. a combination of measurand and measurement method standard devia-tions if the wi are measurements yi for n different measurands that arethemselves independently drawn from a stable process, under the assump-tion of measurement method linearity and constant precision, and

3. a combination of measurement method and bias standard deviations if thewi are measurements yi made for a single measurand, but with randomlyvarying and uncorrected-for vectors of variables zi that potentially affectbias, under the assumption of constant measurement precision.

Let us elaborate somewhat on contexts 1-3.First, as for context 1 in Section 3.1, where repeat measurements y1, y2, . . . , yn

for measurand x are made by the same method under fixed physical conditions,the development of Section 2.2.1 is relevant. An iid model with fixed marginalvariance σ2 (x) (or σ2 (x, z0) for fixed z0) is plausible. Upon making a normaldistribution assumption, the limits (21) then serve to estimate σ (x) (or σ (x, z0)for fixed z0) quantifying measurement precision.For context 2 (as for the corresponding context in Section 3.1), where single

measurements y1, y2, . . . , yn for measurands x1, x2, . . . , xn (modeled as gener-ated in an iid fashion from a distribution with Varx = σ2x) are made by alinear device with constant precision, the development of Section 2.2.2 is rele-vant. If measurement errors are modeled as iid with mean δ (the fixed bias)and variance σ2, display (7) gives marginal variance for the (iid) measurements,Vary = σ2x + σ2. Then under a normal distribution assumption, limits (21)applied to the n measurements yi serve to estimate

√σ2x + σ2.

Notice that in the event that one or the other of σx and σ can be taken as"known," confidence limits for

√σ2x + σ2 can be algebraically manipulated to

produce confidence limits for the other standard deviation. If, for example,one treats measurement precision (i.e., σ2) as known and constant, limits (21)computed from single measurements y1, y2, . . . , yn for iid measurands correspondto limits√

max

(0, s2

(n− 1

χ2upper

)− σ2

)and/or

√max

(0, s2

(n− 1

χ2lower

)− σ2

)(22)

for σx quantifying the variability of the measurands. In an engineering produc-tion context, these are limits for the "process standard deviation" uninflated bymeasurement noise.Finally consider context 3. As for the corresponding context in Section

3.1, take as motivation a situation where n operators each provide a single

23

measurement of some geometric feature of a fixed metal part using a singlegauge and an operator-dependent bias, say δ (x, zi), is associated with operatori. Abbreviating this bias to δi and again arguing as in the development ofSection 2.2.3, one might suppose that measurement precision is constant andthe δi are iid with Varδ = σ2δ . That (as before) gives Vary = σ2δ +σ2, and limits(21) applied to n iid measurements yi serve to estimate

√σ2δ + σ2.

In industrial instances of this context (of multiple operators making singlemeasurements on a fixed item) the variances σ2 and σ2δ are often called respec-tively repeatability and reproducibility variances. The first is typicallythought of as intrinsic to the gauge or measurement method and the second aschargeable to consistent (and undesirable and potentially reducible) differencesin how operators use the method or gauge. As suggested regarding context2, where one or the other of σ and σδ can be treated as known from previousexperience, algebraic manipulation of confidence limits for

√σ2δ + σ2 lead (in a

way parallel to display (22)) to limits for the other standard deviation. We willlater (in Section 5.6) consider the design and analysis of studies intended to atonce provide inferences for both σ and σδ.A second motivation for context 3 is one where there may be only a sin-

gle operator, but there are n calibration periods involved, period i having itsown period-dependent bias, δi. Calibration is never perfect, and one mightagain suppose that measurement precision is constant and the δi are iid withVarδ = σ2δ . In this scenario (common, for example, in metrology for nuclearsafeguards) σ2 and σ2δ are again called respectively repeatability and repro-ducibility variances, but the meaning of the second of these is not variation ofoperator biases, but rather variation of calibration biases.

As a final topic in this section, consider Bayesian analyses that can produceinferences for a single standard deviation parallel to those represented by dis-play (21). The Bayesian discussion of the previous section and in particularthe implications of the modeling represented by (20) were focused on posteriorprobability statements for µw. As it turns out, this same modeling and com-putation produces a simple marginal posterior distribution for σw. That is,following from joint "density" (21) is a conditional distribution for σ2w whichis that of (n− 1) s2w/X for X a χ2n−1 random variable. That implies that thelimits (21) are both frequentist confidence limits and also Bayesian posteriorprobability limits for σw. That is, for the improper independent uniform priorson µw and γw = ln (σw), Bayesian inference for σw is operationally equivalentto "ordinary" frequentist inference. Further, WinBUGS Code Set 1 in Section3.1 serves to generate not only simulated values of µw, but also simulated valuesof σw from the joint posterior of these variables. (One needs only Code Set 1to do Bayesian inference in this problem for any function of the pair (µw, σw),including the individual variables.)

24

3.3 Frequentist and Bayesian Inference From a Single Dig-ital Sample

The previous two sections have considered inference associated with a singlesample of measurements, assuming that these are conceptually real numbers.We now consider the related problem of inference assuming that measurementshave been rounded/quantized/digitized. We will use the modeling ideas ofSection 2.2.4 based on a normal model for underlying conceptually real-valuedmeasurements y. That is, we suppose that y1, y2, . . . , yn are modeled as iidN(µ, σ2

)(where depending upon the data collection plan and properties of

the measurement method, the model parameters µ and σ can have any of theinterpretations considered in the previous two Sections, 3.1 and 3.2). Availablefor analysis are then integer-valued versions of the yi, byic. For f

(·|µ, σ2

)the

univariate normal pdf, the likelihood function (of the two parameters) is

L(µ, σ2

)=

n∏i=1

∫ byic+.5byic−.5

f(t|µ, σ2

)dt (23)

and the corresponding log-likelihood function for µ and γ = ln (σ) is

l (µ, γ) = lnL (µ, exp (2γ)) (24)

and is usually taken as the basis of frequentist inference.One version of inference for the parameters based on the function (24) is

roughly this. Provided that the range of the byic is at least 2, the function (24) isconcave down, a maximizing pair (µ, γ) can serve as a joint maximum likelihoodestimate of the vector, and the diagonal elements of the negative inverse Hessianfor this function (evaluated at the maximizer) can serve as estimated variances

for estimators µ̂ and γ̂, say σ̂2µ̂ and σ̂2γ̂ . These in turn lead to approximateconfidence limits for µ and γ of the form

µ̂± zσ̂µ̂ and γ̂ ± zσ̂γ̂

(for z a small upper percentage point of the N(0, 1) distribution) and the secondof these provides limits for σ2 after multiplication by 2 and exponentiation.This kind of analysis can be implemented easily in some standard statistical

software suites that do "reliability" or "life data analysis" as follows. If y isnormal, then exp (y) is "lognormal." The lognormal distribution is commonlyused in life data analysis and life data are often recognized to be "intervalcensored" and are properly analyzed as such. Upon entering the intervals

(exp (byic − .5) , exp (byic+ .5))

as observations and specifying a lognormal model in some life data analysisroutines, essentially the above analysis will be done to produce inferences for µand σ2. For example, the user-friendly JMP statistical software (JMP, Version 9,SAS Institute, Cary N.C., 1989-2011) will do this analysis.

25

A different frequentist approach to inference for µ and σ2 in this modelwas taken by Lee and Vardeman (2001) and Lee and Vardeman (2002). Ratherthan appeal to large-sample theory for maximum likelihood estimation as above,they defined limits for the parameters based on profile log-likelihood functionsand used extensive simulations to verify that their intervals have actual coverageproperties at least as good as nominal even in small samples. The Lee-Vardeman(2001) intervals for µ are of the form{

µ| supγl (µ, γ) > l (µ̂, γ̂)− c

}, (25)

where if

c =n

2ln

(1 +

t2

n− 1

)for t the upper α point of the tn−1 distribution, the interval (25) has corre-sponding actual confidence level at least (1− 2α)× 100%. The correspondingintervals for σ are harder to describe in precise terms, but of the same spirit,and the reader is referred to the 2002 paper for details.A Bayesian approach to inference for

(µ, σ2

)in this digital data context is

to replace f (data|µ, exp (2γ)) in (20) with L (µ, exp (2γ)) (for L defined in (23))and with improper uniform priors on µ and γ use L (µ, exp (2γ)) to specify a jointposterior distribution for the mean and log standard deviation. No calculationswith this can be done in closed form, but WinBUGS provides for very convenientsimulation from this posterior. Appropriate example code is presented next.

WinBUGS Code Set 2

model {mu~dflat()logsigma~dflat()sigma<-exp(logsigma)tau<-exp(-2*logsigma)for (i in 1:N) {L[i]<-R[i]-.5U[i]<-R[i]+.5}for (i in 1:N) {Y[i]~dnorm(mu,tau) I(L[i],U[i])}}#here are the hypothetical data againlist(N=5,R=c(4,3,3,2,3))#here is a possible initializationlist(mu=7,logsigma=2)

26

It is worth running both WinBUGS Code Set 1 (in Section 3.1) and WinBUGSCode Set 2 and comparing the approximate posteriors they produce. The pos-teriors for the mean are not very different. But there is a noticeable differencein the posteriors for the standard deviations. In a manner consistent with well-established statistical folklore, the posterior for Code Set 2 suggests a smallerstandard deviation than does that for Code Set 1. It is widely recognized thatignoring the effects of quantization/digitalization typically tends to inflate one’sperception of the standard deviation of an underlying conceptually continuousdistribution. This is clearly an important issue in modern digital measurement.

3.4 Frequentist and Bayesian "Two-Sample" Inference fora Difference in Means

An important elementary statistical method for making comparisons is the fre-quentist t confidence interval for a difference in means. The Satterthwaite ver-sion of this interval, computed from w11, w12, . . . , w1n1with sample mean w1 andsample standard deviation s1 and w21, w22, . . . , w2n2 with sample mean w2 andsample standard deviation s2, has endpoints

w1 − w2 ± t̂

√s21n1

+s22n2

(26)

(for t̂ an upper percentage point from the t distribution with the data-dependent"Satterthwaite approximate degrees of freedom" or, conservatively, degrees offreedom min (n1, n2)− 1). These limits (26) are intended to bracket the differ-ence in the means of the two data-generating mechanisms that produce respec-tively the w1i and the w2i. The probability model assumptions that supportformula (26) are that all of the w’s are independent, the w1i iid N

(µ1, σ

21

)and

the w2i iid N(µ2, σ

22

), and it is the difference µ1 − µ2 that is being estimated.

Depending upon the data collection plan employed and what may be as-sumed about the measurement method(s), there are a number of possible prac-tical meanings for the difference µ1 − µ2. Among them are

1. a difference in two biases, if the w1i and w2i are measurements y1i and y2iof a single measurand, made repeatedly under fixed conditions using twodifferent methods,

2. a difference in two biases, if the w1i and w2i are single measurementsy1i and y2i for n1 + n2 measurands drawn from a stable process, madeusing two different methods under fixed conditions and the assumptionthat both methods are linear,

3. a difference in two measurands, if the w1i and w2i are repeat measure-ments y1i and y2i made on two measurands using a single linear method,and

27

4. a difference in two mean measurands, if the w1i and w2i are single mea-surements y1i and y2i made on n1 and n2 measurands from two stableprocesses made using a single linear method.

Let us elaborate on contexts 1-4 based on the basics of Section 2.2, beginningwith context 1.Where repeat measurements y11, y12, . . . , y1n1 for measurand x are made by

Method 1 under fixed conditions and repeat measurements y21, y22, . . . , y2n2 ofthis same measurand are made under these same fixed conditions by Method2, the development of Section 2.2.1 may be employed twice. An iid model withmarginal mean x+ δ1 (x) (or x+ δ1 (x, z0) for fixed z0) and marginal varianceσ21 (x) (or σ21 (x, z0) for this z0) independent of an iid model with marginalmean x + δ2 (x) (or x + δ2 (x, z0) for this z0) and marginal variance σ22 (x)(or σ22 (x, z0) for this z0) is plausible for the two samples of measurements.Normal assumptions or robustness properties of the method (26) then implythat the Satterthwaite t interval limits applied to the n1 + n2 measurementsserve to estimate δ1 (x)−δ2 (x) (or δ1 (x, z0)−δ2 (x, z0)). Notice that under theassumption that both devices are linear, these limits provide some informationregarding how much higher or lower the first method reads than the second forany x.In context 2, the development of Section 2.2.2 is potentially relevant to the

modeling of y11, y12, . . . , y1n1 and y21, y22, . . . , y2n2 that are single measurementson n1 + n2 measurands drawn from a stable process, made using two differentmethods under fixed conditions. An iid model with marginal mean µx+Eδ1 (x)(or µx+Eδ1 (x, z0) for fixed z0) and marginal variance σ21 ≡Eσ21 (x) indepen-dent of an iid model with marginal mean µx+Eδ2 (x) (or µx+Eδ2 (x, z0) forfixed z0) and marginal variance σ22 ≡Eσ22 (x) might be used if δ1 (x) and δ2 (x)(or δ1 (x, z0) and δ2 (x, z0)) are both constant in x, i.e., both methods are linear.Then normal assumptions or robustness considerations imply that method (26)can be used to estimate δ1 − δ2 (where δj ≡ δj (x) or δj (x, z0)).

In context 3, where y11, y12, . . . , y1n1 and y21, y22, . . . , y2n2 are repeat mea-surements made on two measurands using a single method under fixed condi-tions, the development of Section 2.2.1 may again be employed. An iid modelwith marginal mean x1 + δ (x1) (or x1 + δ (x1, z0) for fixed z0) and marginalvariance σ2 (x1) (or σ2 (x1, z0) for this z0) independent of an iid model with mar-ginal mean x2+δ (x2) (or x2+δ (x2, z0) for this z0) and marginal variance σ2 (x2)(or σ2 (x2, z0) for this z0) is plausible for the two samples of measurements. Nor-mal assumptions or robustness properties of the method (26) then imply thatthe Satterthwaite t interval limits applied to the n1 +n2 measurements serve toestimate x1−x2+(δ (x1)− δ (x2)) (or x1−x2+(δ (x1, z0)− δ (x2, z0))). Then,if the device is linear, one has a way to estimate x1 − x2.

Finally, in context 4, where y11, y12, . . . , y1n1 and y21, y22, . . . , y2n2 are singlemeasurements made on n1 and n2 measurands from two stable processes madeusing a single linear device, the development of Section 2.2.2 may again beemployed. An iid model with marginal mean µ1x + δ and marginal varianceσ21 ≡E1σ2 (x) independent of an iid model with marginal mean µ2x + δ and

28

marginal variance σ22 ≡E2σ2 (x) might be used. Then normal assumptionsor robustness considerations imply that method (26) can be used to estimateµ1x − µ2x.

An alternative to the analysis leading to Satterthwaite confidence limits (26)is this. To the frequentist model assumptions supporting (26), one adds priorassumptions that take all of µ1, µ2, ln (σ1) , and ln (σ2) to be Uniform(−∞,∞)and independent. Provided both n1 and n2 are both at least 2, the formal "pos-terior distribution" of the four model parameters is proper (is a real probabilitydistribution) and posterior intervals for µ1 − µ2 can be obtained by simulatingfrom the posterior. Here is some code for implementing the Bayesian analysisillustrated for an example calculation with n1 = 5 and n2 = 4.

WinBUGS Code Set 3

#here is the model statementmodel {muw1~dflat()logsigmaw1~dflat()sigmaw1<-exp(logsigmaw1)tauw1<-exp(-2*logsigmaw1)for (i in 1:N1) { W1[i]~dnorm(muw1,tauw1)

}muw2~dflat()logsigmaw2~dflat()sigmaw2<-exp(logsigmaw2)tauw2<-exp(-2*logsigmaw2)for (j in 1:N2) { W2[j]~dnorm(muw2,tauw2)

}mudiff<-muw1-muw2sigmaratio<-sigmaw1/sigmaw2}#here are some hypothetical datalist(N1=5,W1=c(4,3,3,2,3),N2=4,W2=c(7,8,4,5))#here is a possible initializationlist(muw1=6,logsigmaw1=2,muw2=8,logsigmaw2=3)

WinBUGS Code Set 3 is an obvious modification of WinBUGS Code Set1 (of Section 3.1). It is worth noting that a similar modification of WinBUGSCode Set 2 (of Section 3.3) provides an operationally straightforward inferencemethod for µ1−µ2 in the case of two digital samples. We note as well, that inanticipation of the next section, WinBUGS Code Set 3 contains a specificationof the ratio σ1/σ2 that enables Bayesian two-sample inference for a ratio ofnormal standard deviations.

29

3.5 Frequentist and Bayesian "Two-Sample" Inference fora Ratio of Standard Deviations

Another important elementary statistical method for making comparisons is thefrequentist F confidence interval for the ratio of standard deviations for two nor-mal distributions. This interval, computed from w11, w12, . . . , w1n1with samplestandard deviation s1 and w21, w22, . . . , w2n2 with sample standard deviations2, has endpoints

s1

s2√Fn1−1,n2−1,upper

and/ors1

s2√Fn1−1,n2−1,lower

(27)

(for Fn1−1,n2−1,upper and Fn1−1,n2−1,lower , respectively, small upper and lowerpercentage points of the Fn1−1,n2−1 distribution). These limits (27) are in-tended to bracket the ratio of standard deviations for two (normal) data-generatingmechanisms that produce respectively the w1i and the w2i. The probabilitymodel assumptions that support formula (27) are that all of the w’s are inde-pendent, the w1i iid N

(µ1, σ

21

)and the w2i iid N

(µ2, σ

22

), and it is the ratio

σ1/σ2 that is being estimated.Depending upon the data collection plan employed and what may be as-

sumed about the measurement method(s), there are a number of possible prac-tical meanings for the ratio σ1/σ2. Among them are

1. the ratio of two device standard deviations, if the w1i and w2i are mea-surements y1i and y2i of a single measurand per device made repeatedlyunder fixed conditions,

2. the ratio of two combinations of device and (the same) measurand stan-dard deviations, if the w1i and w2i are single measurements y1i and y2ifor n1 + n2 measurands drawn from a stable process, made using twodifferent methods under fixed conditions and the assumption that bothmethods are linear and have constant precision, and

3. the ratio of two combinations of (the same) device and measurand stan-dard deviations, if the w1i and w2i are single measurements y1i and y2imade on n1 and n2 measurands from two stable processes made using asingle linear method with constant precision.

Let us elaborate slightly on contexts 1-3, beginning with context 1.Where repeat measurements y11, y12, . . . , y1n1 for measurand x are made by

Method 1 under fixed conditions and repeat measurements y21, y22, . . . , y2n2 of a(possibly different) fixed measurand are made under these same fixed conditionsby Method 2, the development of Section 2.2.1 may be employed. An iid normalmodel with marginal variance σ21 (x1) (or σ21 (x1, z0) for this z0) independentof an iid normal model with marginal variance σ22 (x2) (or σ22 (x2, z0) for thisz0) is potentially relevant for the two samples of measurements. Then, ifmeasurement precision for each device is constant in x or the two measurandsare the same, limits (27) serve to produce a comparison of device precisions.

30

In context 2, the development in Section 2.2.2 and in particular displays (6)and (9) shows that under normal distribution and independence assumptions,device linearity and constant precision imply that limits (27) serve to producea comparison of

√σ2x + σ21 and

√σ2x + σ22 , which obviously provides only an

indirect comparison of measurement precisions σ1 and σ2.Finally, in context 3, the development in Section 2.2.2 and in particular

displays (6) and (9) shows that under normal distribution and independenceassumptions, device linearity and constant precision imply that limits (27) serveto produce a comparison of

√σ21x + σ2 and

√σ22x + σ2, which again provides

only an indirect comparison of process standard deviations σ1x and σ2x.We then recall that the WinBUGS Code Set 3 used in the previous section to

provide an alternative to limits (26) also provides an alternative to limits (27).Further, modification of WinBUGS Code Set 2 for the case of two samplesallows comparison of σ1x and σ2x based on digital data.

3.6 Section Summary Comments

It should now be clear to the reader on the basis of consideration of what arereally fairly simply probability modeling notions of Section 2.2 and the mostelementary methods of standard statistical inference, that

1. how sources of physical variation interact with a data collection plan de-termine what can be learned in a statistical study, and in particular howmeasurement error is reflected in the resulting data,

2. even the most elementary statistical methods have their practical effec-tiveness limited by measurement variation, and

3. even the most elementary statistical methods are helpful in quantifyingthe impact of measurement variation.

In addition, it should be clear that Bayesian methodology (particularly asimplemented using WinBUGS) is a simple and powerful tool for handling inferenceproblems in models that include components intended to reflect measurementerror.

4 Bayesian Computation of Type B Uncertain-ties

Consider now the program for the Bayesian computation of Type B uncertaintiesoutlined in Section 2.3, based on the model indicated in display (18).As a very simple example, consider the elementary physics experiment of

determining a spring constant. Hooke’s law says that over some range of weights(not so large as to permanently deform the spring and not too small), themagnitude of the change in length ∆l of a steel spring when a weight of massM is hung from it is

k∆l = Mg

31

for a spring constant, k, peculiar to the spring so that

k =Mg

∆l. (28)

One version of a standard freshman physics laboratory exercise is as follows.For several different physical weights, initial and stretched spring lengths aremeasured, and the implied values of k computed via (28). (These are thensomehow combined to produce a single value for k and some kind of uncertaintyvalue.) Sadly, there is is often only a single determination made with eachweight, eliminating the possibility of directly assessing a Type A uncertaintyfor the lengths, but here we will consider the possibility that several "runs" aremade with each physical weight.In fact, suppose that r different weights are used, weight i with nominal

value of Mg equal to φi and standard uncertainty ui = 0.01φi for its value ofMg. Then suppose that weight i is used ni times and length changes

∆lij for j = 1, 2, . . . , ni

(these each coming from the difference of two measured lengths) are produced.A possible model here is that actual values ofMg for the weights are independentrandom variables,

wi ∼ N(φi, (0.01φi)

2)

and the length changes ∆lij are independent random variables,

∆lij ∼ N(wi/k, σ

2)

(this following from the kind of considerations in Section 2.2.1 if the lengthmeasuring device is linear with constant precision and the variability of lengthchange doesn’t vary with the magnitude of the change). Then placing inde-pendent U(−∞,∞) improper prior distributions on k and lnσ, one can useWinBUGS to simulate from the posterior distribution of (k, σ, φ1, φ2, . . . , φr), thek-marginal of which provides both a measurement for k (the mean or median)and a standard uncertainty (the standard deviation).The WinBUGS code below can be used to implement the above analysis,

demonstrated here for a data set where r = 10 and each ni = 4.

WinBUGS Code Set 4

#here is the model statementmodel {k~dflat()logsigma~dflat()for (i in 1:r) {tau[i]<-1/(phi[i]*phi[i]*(.0001))

}for (i in 1:r) {

32

w[i]~dnorm(phi[i],tau[i])}

sigma<-exp(logsigma)taudeltaL<-exp(-2*logsigma)for (i in 1:r) {mu[i]<-w[i]/k

}for (i in 1:r) {for (j in 1:m) {deltaL[i,j] ~dnorm (mu[i],taudeltaL)

}}

}#here are some example data taken from an online#Baylor University sample physics lab report found 6/3/10 at#http://www.baylor.edu/content/services/document.php/110769.pdf#lengths are in cm and weights(forces) are in Newtonslist(r=10,m=4,phi=c(.0294,.0491,.0981,.196,.392,.589,.785,.981,1.18,1.37),deltaL=structure(.Data=c(.94,1.05,1.02,1.03,1.60,1.68,1.67,1.69,3.13,3.35,3.34,3.35,6.65,6.66,6.65,6.62,13.23,13.33,13.25,13.32,19.90,19.95,19.94,19.95,26.58,26.58,26.60,26.60,33.23,33.25,33.23,33.23,39.92,39.85,39.83,39.83,46.48,46.50,46.45,46.43), .Dim=c(10,4)))#here is a possible initializationlist(w=c(.0294,.0491,.0981,.196,.392,.589,.785,.981,1.18,1.37),k=2.95,logsigma=-3)

It is potentially of interest that running Code Set 4 produces a posteriormean for k around 2.954 N /m and a corresponding posterior standard deviationfor k around 0.010 N /m. These values are in reasonable agreement with aconventional (but somewhat ad hoc) analysis in the original data source. Thepresent analysis has the virtue of providing a completely coherent integration ofthe Type B uncertainty information provided for the magnitudes of the weightsemployed in the lab with the completely empirical measurements of averagelength changes (possessing their own Type A uncertainty) to produce a rationaloverall Type B standard uncertainty for k.The obvious price to be paid before one is able to employ the kind of Bayesian

analysis illustrated here in the assessment of Type B uncertainty is the develop-ment of non-trivial facility with probability modeling and some familiarity withmodern Bayesian computation. The modeling task implicit in display (18) is

33

not trivial. But there is really no simple-minded substitute for the methodi-cal technically-informed clear thinking required to do the modeling, if a correctprobability-based assessment of uncertainty is desired.

5 Some Intermediate Statistical InferenceMeth-ods and Measurement

5.1 Regression Analysis

Regression analysis is a standard intermediate level statistical methodology. Itis relevant to measurement in several ways. Here we consider its possible usesin the improvement of a measurement system through calibration, and throughthe correction of measurements for the effects of an identifiable and observedvector of variables z other than the measurand.

5.1.1 Simple Linear Regression and Calibration

Consider the situation where (through the use of standards) one essentially"knows" the values of several measurands and can thus hope to compare mea-sured values from one device to those measurands. (And to begin, we considerthe case where there is no identifiable set of extraneous/additional variablesbelieved to affect the distribution of measurement error whose effects one hopesto model.)A statistical regression model for an observable y and unobserved function

µ (x) (potentially applied to measurement y and measurand x) is

y = µ (x) + ε (29)

for a mean 0 error random variable, ε, often taken to be normal with standarddeviation σ that is independent of x. µ (x) is often assumed to be of a fairlysimple form depending upon some parameter vector β. The simplest commonversion of this model is the "simple linear regression" model where

µ (x) = β0 + β1x . (30)

Under the simple linear regression model for measurement y and measurand x,β1 = 1 means that the device being modeled is linear and δ = β0 is the devicebias. If β1 6= 1, then the device is non-linear in metrology terminology (eventhough the mean measurement is a linear function of the measurand).Of course, rearranging relationship (30) slightly produces

x =µ (x)− β0

β1.

This suggests that if one knows with certainty the constants β0 and β1, a (lin-early calibrated) improvement on the measurement y (one that largely elimi-nates bias in a measurement y) is

x∗ =y − β0β1

. (31)

34

This line of reasoning even makes sense where y is not in the same unitsas x (and thus is not directly an approximation for the measurand). Takefor example a case where temperature x is to be measured by evaluating theresistivity y of some material at that temperature. In that situation, a rawy is not a temperature, but rather a resistance. If average resistance is alinear function of temperature, then form (31) represents the proper conversionof resistance to temperature (the constant β1 having the units of resistancedivided by those of temperature).So consider then the analysis of a calibration experiment in which for i =

1, 2, . . . , n the value xi is a known value of a measurand and corresponding toxi, a fixed measuring device produces a value yi (that might be, but need notbe, in the same units as x). Suppose that the simple linear regression model

yi = β0 + β1xi + εi (32)

for the εi iid normal with mean 0 and standard deviation σ is assumed to beappropriate. Then the theory of linear models implies that for b0 and b1 theleast squares estimators of β0 and β1, respectively, and

sSLR =

√∑ni=1 (yi − (b0 + b1xi))

2

n− 2,

confidence limits for β1 are

b1 ± tsSLR√∑n

i=1 (xi − x)2

(33)

(for t here and throughout this section a small upper percentage point of thetn−2 distribution) and confidence limits for µ (x) = β0 + β1x for any value of x(including x = 0 and thus the case of β0) are

(b0 + b1x)± tsSLR

√1

n+

(x− x)2∑n

i=1 (xi − x)2 . (34)

Limits (33) and (34) provide some sense as to how much information the (xi, yi)data from the calibration experiment provide concerning the constants β0 andβ1 and indirectly how well the transformation

x∗∗ =y − b0b1

can be expected to do at approximating the transformation (31) giving thecalibrated version of y.

What is more directly relevant to the problem of providing properly cali-brated measurements with attached appropriate (Type A) uncertainty figuresthan the confidence limits (33) and (34) are statistical prediction limits for a

35

new or (n+ 1)st y corresponding to a measurand x. These are well known tobe

(b0 + b1x)± tsSLR

√1 +

1

n+

(x− x)2∑n

i=1 (xi − x)2 . (35)

What then is less well known (but true) is that if one assumes that for anunknown x, an (n+ 1)st observation ynew is generated according to the samebasic simple linear regression model (32), a confidence set for the (unknown) xis

{x| the set of limits (35) bracket ynew} . (36)

Typically, the confidence set (36) is an interval that contains

x∗∗new =ynew − b0

b1

that itself serves as a single (calibrated) version of ynew and thus provides TypeA uncertainty for x∗∗new . (The rare cases in which this is not true are thosewhere the calibration experiment leaves large uncertainty about the sign of β1,in which case the confidence set (36) may be of the form (−∞,#) ∪ (##,∞)for two numbers # < ##.) Typically then, both an approximately calibratedversion of a measurement and an associated indication (in terms of confidencelimits for x) of uncertainty can be read from a plot of a least squares line andprediction limits (35) for all x, much as suggested in Figure 1.

Figure 1: Plot of Data from a Calibration Experiment, Least Squares Line, andPrediction Limits for ynew .

A Bayesian analysis of a calibration experiment under model (32) producingresults like the frequentist ones is easy enough to implement, but not withoutphilosophical subtleties that have been a source of controversy over the years.

36

One may treat x1, x2, . . . , xn as known constants, model y1, y2, . . . , yn as inde-pendent normal variables with means β0 + β1xi and standard deviation σ, andthen suppose that for fixed/known ynew , a corresponding xnew is normal withmean (ynew − β0) /β1 and standard deviation σ/ |β1|.3 Then upon placing in-dependent U(−∞,∞) improper prior distributions on β0, β1, and lnσ, one canuse WinBUGS to simulate from the joint posterior distribution of the parametersβ0, β1, and σ, and the unobserved xnew . The xnew -marginal then provides acalibrated measurement associated with the observation ynew and a correspond-ing uncertainty (in, respectively, the posterior mean and standard deviation ofthis marginal).The WinBUGS code below can be used to implement the above analysis for

an n = 6 data set taken from a (no longer available) web page of the School ofChemistry at the University of Witwatersrand developed by D.G. Billing. Mea-sured absorbance values, y, for solutions with "known" Cr6+ concentrations, x(in mg / l) used in the code are, in fact, the source of Figure 1.

WinBUGS Code Set 5

model {beta0~dflat()beta1~dflat()logsigma~dflat()for (i in 1:n) {mu[i]<-beta0+(beta1*x[i])}sigma<-exp(logsigma)tau<-exp(-2*logsigma)for (i in 1:n) {y[i]~dnorm(mu[i],tau)}munew<-(ynew-beta0)/beta1taunew<-(beta1*beta1)*tauxnew~dnorm(munew,taunew)}#here are the calibration datalist(n=6,ynew=.2,x=c(0,1,2,4,6,8),y=c(.002,.078,.163,.297,.464,.600))#here is a possible initializationlist(beta0=0,beta1=.1,logsigma=-4,xnew=3)

The reader can verify that the kind of uncertainty in xnew indicated in Figure1 is completely consistent with what is indicated using the WinBUGS Bayesiantechnology.

3This modeling associated with (xnew , ynew ) is perhaps not the most natural that couldbe proposed. But it turns out that what initially seem like potentially more appropriateassumptions lead to posteriors with some quite unattractive properties.

37

5.1.2 Regression Analysis and Correction of Measurements for theEffects of Extraneous Variables

Another potential use of regression analysis in a measurement context is in thedevelopment of a correction of a measurement y for the effects of an identifiableand observed vector of variables z other than the measurand believed to affectthe error distribution. If measurements of known measurands, x, can be ob-tained for a variety of vectors z, regression analysis can potentially be used tofind formulas for appropriate corrections.Suppose that one observes measurements y1, y2, . . . , yn of measurands x1, x2,

. . . , xn under observed conditions z1, z2, . . . ,zn. Under the assumptions that

1. for any fixed z the measurement device is linear (any measurement biasdoes not depend upon x, but only potentially upon z) and

2. the precision of the measurement device is a function of neither the mea-surand nor z (the standard deviation of y for fixed x and z is not a functionof these),

it potentially makes sense to consider regression models of the form

(yi − xi) = β0 +

p∑l=1

βlzil + εi . (37)

The mean (of (yi − xi)) in this statistical model, β0 +∑pl=1 βlzil, is δ (zi), the

measurement bias. Upon application of some form of fitting for model (37) toproduce, say,

δ̂ (zi) = b0 +

p∑l=1

blzil , (38)

a possible approximately-bias-corrected/calibrated version of a measurement ymade under conditions z becomes

y∗ = y − δ̂ (z) .

Ordinary multiple linear regression analysis can be used to fit model (37) toproduce the estimated bias (38). Bayesian methodology can also be used. Onesimply uses independent U(−∞,∞) improper prior distributions for β0, β1, . . . , βpand lnσ, employs WinBUGS to simulate from the joint posterior distribution ofthe parameters β0, β1, . . . , βp and σ, and uses posterior means for the regressioncoeffi cients as point estimates b0, b1, . . . , bp.

5.2 Shewhart Charts for Monitoring Measurement DeviceStability

Shewhart control charts are commonly used for monitoring production processesfor the purpose of change detection. (See, for example, their extensive treat-ment in books such as Vardeman and Jobe (1999) and Vardeman and Jobe

38

(2001).) That is, they are commonly used to warn that something unexpectedhas occurred and that an industrial process is no longer operating in a standardmanner. These simple tools are equally useful for monitoring the performanceof a measurement device over time. That is, for a fixed measurand x (associatedwith some physically stable specimen) that can be measured repeatedly with aparticular device, suppose that initially the device produces measurements, y,with mean µ (x) and standard deviation σ (x) (that, of course, must usually beestimated through the processing of n measurements y1, y2, . . . , yn into, mostcommonly, a sample mean y and sample standard deviation s). In what follows,we will take these as determined with enough precision that they are essentially"known."Then suppose that periodically (say at time i = 1, 2, . . .), one remeasures x

and obtains m values y that are processed into a sample mean yi and samplestandard deviation si. Shewhart control charts are plots of yi versus i (theShewhart "x" chart) and si versus i (the Shewhart "s" chart) augmented with"control limits" that separate values of yi or si plausible under a "no change inthe measuring device" model from ones implausible under such a model. Pointsplotting inside the control limits are treated as lack of definitive evidence of achange in the measurement device, while ones plotting outside of those limits aretreated as indicative of a change in measurement. Of course, the virtue to behad in plotting the points is the possibility it provides of seeing trends potentiallyproviding early warning of (or interpretable patterns in) measurement change.Standard practice for Shewhart charting of means is to rely upon at least

approximate normality of yi under the "no change in measurement" model andset lower and upper control limits at respectively

LCLy = µ (x)− 3σ (x)√m

and UCLy = µ (x) + 3σ (x)√m

. (39)

Something close to standard practice for Shewhart charting of standard devia-tions is to note that normality for measurements y implies that under the "nochange in measurement" model, (m− 1) s2/σ2 (x) is χ2m−1, and to thus set lowerand upper control limits for si at√

σ2 (x)χ2lowerm− 1

and

√σ2 (x)χ2upper

m− 1(40)

for χ2lower and χ2upper , respectively, small lower and upper percentage points for

the χ2m−1 distribution.Plotting of yi with limits (39) is a way of monitoring basic device calibration.

A change detected by this "x" chart suggests that the mean measurement andtherefore the measurement bias has drifted over time. Plotting of si withlimits (40) is a way of guarding against unknown change in basic measurementprecision. In particular, si plotting above an upper control limit is evidence ofdegradation in a device’s precision.There are many other potential applications of Shewhart control charts to

metrology. Any statistic that summarizes performance of a process and is newly

39

computed based on current process data at regular intervals, might possibly beusefully control charted. So, for example, in situations where a device is reg-ularly recalibrated using the simple linear regression of Section 5.1.1, thoughone expects the fitted values b0, b1, and sSLR to vary period-to-period, appro-priate Shewhart charting of these quantities could be used to alert one to anunexpected change in the pattern of calibrations (potentially interpretable as afundamental change in the device).

5.3 Use of Two Samples to Separate Measurand StandardDeviation and Measurement Standard Deviation

A common problem in engineering applications of statistics is the estimation ofa process standard deviation. As is suggested in Section 2.2.2 and in particulardisplay (7), n measurements y1, y2, . . . , yn of different measurands x1, x2, . . . , xnthemselves drawn at random from a fixed population or process with standarddeviation σx, will vary more than the measurands alone because of measurementerror. We consider here using two (different kinds of) samples to isolate theprocess variation, in a context where a linear device has precision that is constantin x and remeasurement of the same specimen is possible.So suppose that y1, y2, . . . , yn are as just described and that m additional

measurements y′1, y′2, . . . , y

′m are made for the same (unknown) measurand, x.

Under the modeling of Sections 2.2.1 and 2.2.2 and the assumptions of linearityand constant precision, for s2y the sample variance of y1, y2, . . . , yn and s

2y′ the

sample variance of y′1, y′2, . . . , y

′m, the first of these estimates σ

2x + σ2 and the

second estimates σ2, suggesting

σ̂x =

√max

(0, s2y − s2y′

)as an estimate of σx. The so-called "Satterthwaite approximation" (that es-sentially treats σ̂x

2 as if it were a multiple of a χ2 distributed variable andestimates both the degrees of freedom and the value of the multiplier) thenleads to approximate confidence limits for σx of the form

σ̂x

√ν̂

χ2upperand/or σ̂x

√ν̂

χ2lower(41)

for

ν̂ =σ̂x

4

s4yn− 1

+s4y′

m− 1

and χ2upper and χ2lower percentage points for the χ

2ν̂ distribution. This method

is approximate at best. A more defensible way of doing inference for σx isthrough a simple Bayesian analysis.That is, it can be appropriate to model y1, y2, . . . , yn as iid normal variables

with mean µy and variance σ2x+σ2 independent of y′1, y′2, . . . , y

′m modeled as iid

40

normal variables with mean µy′ and variance σ2. Then using (independentimproper) U(−∞,∞) priors for all of µy, µy′ , lnσx, and lnσ one might useWinBUGS to find posterior distributions, with focus on the marginal posteriorof σx in this case. Some example code for a case with n = 10 and m = 7 isnext. (The data are real measurements of the sizes of some binder clips madewith a vernier micrometer. Units are mm above 32.00 mm.)

WinBUGS Code Set 6

model {muy~dflat()logsigmax~dflat()sigmax<-exp(logsigmax)sigmasqx<-exp(2*logsigmax)muyprime~dflat()logsigma~dflat()sigma<-exp(logsigma)sigmasq<-exp(2*logsigma)tau<-exp(-2*logsigma)sigmasqy<-sigmasqx+sigmasqtauy<-1/sigmasqyfor (i in 1:n) {y[i]~dnorm(muy,tauy)}for (i in 1:m) {yprime[i]~dnorm(muyprime,tau)}}#here are some real datalist(n=10,y=c(-.052,-.020,-.082,-.015,-.082,-.050,-.047,-.038,-.083,-.030),m=7,yprime=c(-.047,-.049,-.052,-.050,-.052,-.057,-.047))#here is a possible initializationlist(muy=-.05,logsigmax=-3.6,muyprime=-.05,logsigma=-5.6)

It is perhaps worth checking that at least for the case illustrated in CodeSet 6 (where σ appears to be fairly small in comparison to σx) the Bayes95% credible interval for the process/measurand standard deviation σx is insubstantial agreement with nominally 95% limits (41).

5.4 One-Way Random Effects Analyses and Measurement

A standard model of intermediate level applied statistics is the so called "one-way random effects model." This, for

wij = the jth observation in an ith sample of ni observations

41

employs the assumption that

wij = µ+ αi + εij (42)

for µ an unknown parameter, α1, α2, . . . , αI iid normal random variables withmean 0 and standard deviation σα independent of ε11, ε12, . . . , ε1n1 , ε21, . . . ,εI−1,nI−1 , εI1, . . . , εInI that are iid normal with mean 0 and standard deviationσ. Standard statistical software can be used to do frequentist inference for theparameters of this model (µ, σα, and σ) and Code Set 7 below illustrates thefact that a corresponding Bayesian analysis is easily implemented in WinBUGS.In that code, we have specified independent improper U(−∞,∞) priors for µand lnσ, but have used a (proper) "inverse-gamma" prior for σ2α. It turns outthat in random effects models, U(−∞,∞) priors for log standard deviations(except that of the ε’s) fails to produce a legitimate posterior. In the presentcontext, for reasonably large I (say I ≥ 8) an improper U(0,∞) prior can beeffectively used for σα. Gelman (2006) discusses this issue in detail.

WinBUGS Code Set 7

model {MU~dflat()logsigma~dflat()sigma<-exp(logsigma)tau<-exp(-2*logsigma)taualpha~dgamma(.001,.001)sigmaalpha<-1/sqrt(taualpha)for (i in 1:I) {mu[i]~dnorm(MU,taualpha)}for (k in 1:n) {w[k]~dnorm(mu[sample[k]],tau)}}

#The data here are Argon sensitivities for a mass spectrometer#produced on 3 different days, taken from a 2006 Quality#Engineering paper of Vardeman, Wendelberger and Wanglist(n=44,I=3,w=c(31.3,31.0,29.4,29.2,29.0,28.8,28.8,27.7,27.7,27.8,28.2,28.4,28.7,29.7,30.8,30.1,29.9,32.5,32.2,31.9,30.2,30.2,29.5,30.8,30.5,28.4,28.5,28.8,28.8,30.6,31.0,31.7,29.8,29.6,29.0,28.8,29.6,28.9,28.3,28.3,28.3,29.2,29.7,31.1),sample=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3))#here is a possible initializationlist(MU=30,logsigma=0,taualpha=1,mu=c(30,30,30))

There are at least two important engineering measurement contexts wherethe general statistical model (42) and corresponding data analyses are relevant:

42

1. a single measurement device is used to produce measurements of multiplemeasurands drawn from a stable process multiple times apiece, or where

2. one measurand is measured multiple times using each one of a sample ofmeasurement devices drawn from a large family of such devices.

The second of these contexts is common in engineering applications where onlyone physical device is used, but data groups correspond to multiple operatorsusing that device. Let us elaborate a bit on these two contexts and the corre-sponding use of the one-way random effects model.In context 1, take

yij = the jth measurement on item i drawn from a fixed

population of items or process producing items .

If, as in Section 2.2.2,

xi = the measurand for the ith item

and the xi are modeled as iid with mean µx and variance σ2x, linearity andconstant precision of the measurement device then make it plausible to modelthese measurands as independent of iid measurement errors yij − xi that havemean (bias) δ and variance σ2. Then with the identifications

µ = µx + δ, αi = xi − µx, and εij = (yij − xi)− δ ,

wij in display (42) is yij and adding normal distribution assumptions producesan instance of the one-way normal random effects model with σα = σx. Thenfrequentist or Bayesian analyses for the one-way model applied to the yij pro-duce ways (more standard than that developed in Section 5.3) of separatingprocess variation from measurement variation and estimating the contributionof each to overall variability in the measurements.In context 2, take

yij = the jth measurement on a single item made

using randomly selected device or method i .

Then as in Section 2.2.3 suppose that (for the single measurand under consid-eration) bias for method/device i is δi. It can be appropriate to model the δias iid with mean µδ and variance σ2δ . Then with the identifications

µ = x+ µδ, αi = δi − µδ, and εij = (yij − x)− δi ,

wij in display (42) is yij and adding normal distribution assumptions producesan instance of the one-way normal random effects model with σα = σδ. Thenfrequentist or Bayesian analyses for the one-way model applied to the yij pro-duces a way of separating σδ from σ and estimating the contribution of eachto overall variability in the measurements. We note once more that, in the

43

version of this where what changes device-to-device is only the operator us-ing a piece of equipment, σ is often called a "repeatability" standard deviationand σδ, measuring as it does operator-to-operator variation, is usually called a"reproducibility" standard deviation.A final observation here comes from the formal similarity of the application

of one-way methods to scenarios 1 and 2, and the fact that Section 5.3 providesa simple treatment for scenario 1. Upon proper reinterpretation, it must thenalso provide a simple treatment for scenario 2, and in particular for separatingrepeatability and reproducibility variation. That is, where a single item ismeasured once each by n different operators and then m times by a singleoperator, the methodologies of Section 5.3 could be used to estimate (not σx butrather) σδ and σ, providing the simplest possible introduction to the engineeringtopic of "gauge R&R."

5.5 A Generalization of Standard One-Way Random Ef-fects Analyses and Measurement

The model (42), employing as it does a common standard deviation σ acrossall indices i, could potentially be generalized by assuming that the standarddeviation of εij is σi (where σ1, σ2, . . . , σI are potentially different). Witheffects αi random as in Section 5.4, one then has a model with parametersµ, σα, σ1, σ2, . . . , and σI . This is not a standard model of intermediate statis-tics, but handling inference under it in a Bayesian fashion is really not muchharder than handling inference under the usual one-way random effects model.Bayesian analysis (using independent improper U(−∞,∞) priors for all of µ,lnσ1, . . . , lnσI and a suitable proper prior for σα per Gelman (2006)) is easilyimplemented in WinBUGS by making suitable small modifications of Code Set7.A metrology context where this generalized one-way random effects model

is useful is that of a round robin study, where the same measurand is measuredat several laboratories with the goals of establishing a consensus value for the(unknown) measurand and lab-specific assessments of measurement precision.For

wij = yij = the jth of ni measurements made at lab i ,

one might take µ as the ideal (unknown) consensus value, σα a measure of lab-to-lab variation in measurement, and each σi as a lab i precision. (Note that ifthe measurand were known, the lab biases µ+ αi − x would be most naturallytreated as unknown model parameters.)

5.6 Two-Way Random Effects Analyses andMeasurement

Several models of intermediate-level applied statistics concern random effectsand observations naturally thought of as comprising a two-way arrangement ofsamples. (One might envision samples laid out in the cells of some two-way

44

row-by-column table.) For

wijk = the kth observation in a sample of nij observations in the

ith row and jth column of a two-way structure ,

we consider random effects models that recognize the two-way structure of thesamples. (Calling rows levels of some factor, A, and columns levels of somefactor, B, it is common to talk about A and B effects on the cell means.) Thiscan be represented through assumptions that

wijk = µ+ αi + βj + εijk (43)

orwijk = µ+ αi + βj + αβij + εijk (44)

under several possible interpretations of the αi, βj , and potentially the αβij (µis always treated as an unknown parameter, and the εijk are usually taken to beiid N

(0, σ2

)). Standard "random effects" assumptions on the terms in (43) or

(44) are that the effects of a given type are iid random draws from some mean0 normal distribution. Standard "fixed effects" assumptions on the terms in(43) or (44) are that effects of a given type are unknown parameters.A particularly important standard application of two-way random effects

analyses in measurement system capability assessment is to gauge R&R studies.In the most common form of such studies, each of I different items/parts aremeasured several times by each of J different operators/technicians with a fixedgauge as a way of assessing measurement precision obtainable using the gauge.With parts on rows and operators on columns, the resulting data can be thoughtof as having two-way structure. (It is common practice to make all nij thesame, though nothing really requires this or even that all nij > 0, except forthe availability of simple formulas and/or software for frequentist analysis.) Sowe will here consider the implications of forms (43) and (44) in a case where

wijk = yijk = the kth measurement of part i by operator j .

Note that for all versions of the two-way model applied to gauge R&R studies,the standard deviation σ quantifies variability in measurement of a given partby a given operator, the kind of measurement variation called "repeatability"variation in Sections 3.2 and 5.4.

5.6.1 Two-Way Models Without Interactions and Gauge R&R Stud-ies

To begin, (either for all αi and βj fixed effects or conditioning on the values ofrandom row and column effects) model (43) says that the mean measurementon item i by operator j is

µ+ αi + βj . (45)

Now, if one assumes that each measurement "device" consisting of operatorj using the gauge being studied (in a standard manner, under environmental

45

conditions that are common across all measurements, etc.) is linear, then forsome operator-specific bias δj the mean measurement on item i by operator j is

xi + δj (46)

(for xi the ith measurand).Treating the column effects in display (45) as random and averaging produces

an average row i cell mean (under the two-way model with no interactions)

µ+ αi

and treating the operator biases in display (46) as random with mean µδ pro-duces a row i mean operator average measurement

xi + µδ .

Then combining these, it follows that applying a random-operator-effects two-way model (43) under an operator linearity assumption, one is effectively as-suming that

αi = xi + µδ − µ

and thus that differences in αi’s are equally differences in measurands xi (andso, if the row effects are assumed to be random, σα = σx).In a completely parallel fashion, applying a random-part-effects two-way

model (43) under an operator linearity assumption, one is assuming that

βj = δj + µx − µ

and thus that differences in βj’s are equally differences in operator-specific biasesδj (and so, if column effects are assumed to be random, σβ = σδ a measure of"reproducibility" variation in the language of Sections 2.2.3 and 5.4).

5.6.2 Two-Way Models With Interactions and Gauge R&R Studies

The relationship (46) of Section 5.6.1, appropriate when operator-gauge "de-vices" are linear, implies that the no-interaction form (45) is adequate to rep-resent any set of measurands and operator biases. So if the more complicatedrelationship

µ+ αi + βj + αβij (47)

for the mean measurement on item i by operator j (or conditional mean in theevent that row, column, or cell effects are random) from model (44) is requiredto adequately model measurements in a gauge R&R study, the interactions αβijmust quantify non-linearity of operator-gauge devices.Consider then gauge R&R application of the version of the two-way model

where all of row, column, and interaction effects are random. σαβ (describing asit does the distribution of the interactions) is a measure of overall non-linearity,and inference based on the model (44) that indicates that this parameter issmall would be evidence that the operator-gauge devices are essentially linear.

46

Lacking linearity, a reasonable question is what should be called "repro-ducibility" variation in cases where the full complexity of the form (47) is re-quired to adequately model gauge R&R data. An answer to this question canbe made by drawing a parallel to the one-way development of Section 5.4. Fix-ing attention on a single part, one has (conditional on that part) exactly thekind of data structure discussed in Section 5.4. The group-to-group standarddeviation (σα in the notation of the previous section) quantifying variation ingroup means, is in the present modeling√

σ2β + σ2αβ

quantifying the variation in the sums βj+αβij (which are what change cell meanto cell mean in form (47) as one moves across cells in a fixed row of a two-waytable). It is thus this parametric function that might be called σreproducibility inan application of the two-way random effects model with interactions to gaugeR&R data.Further, it is sensible to define the parametric function

σR&R =√σ2 + σ2reproducibility =

√σ2β + σ2αβ + σ2 ,

the standard deviation associated by the two-way random effects model withinteractions with (for a single fixed part) a single measurement made by a ran-domly chosen operator. With this notation, the parametric functions σR&Rand its components σrepeatability = σ and σreproducibility are of interest to practi-tioners.

5.6.3 Analyses of Gauge R&R Data

Two-way random effects analogues of the Bayesian analyses presented beforein this exposition are more or less obvious. Independent improper U(−∞,∞)priors for all fixed effects and the log standard deviation of ε’s and appropriateproper priors for the other standard deviations and use of WinBUGS to findposterior distributions for quantities of interest (for example, σR&R from Section5.6.2) are possible and effective. For more details and some correspondingWinBUGS code, see Weaver et al (2011).

6 Concluding Remarks

The intention here has been to outline the relevance of statistics to metrologyfor physical science and engineering in general terms. The particular statisti-cal methods and applications to metrology introduced in Sections 3, 4, and 5barely scratch the surface of what is possible or needed. This area has hugepotential for both stimulating important statistical research and providing realcontributions to the conduct of science and engineering. In conclusion here, wemention (with even less detail than we have provided in what has gone before)some additional opportunities for statistical collaboration and contribution inthis area.

47

6.1 Other Measurement Types

Our discussion in this article has been carried primarily by reference to uni-variate and essentially real-valued measurements and simple statistical methodsfor them. But these are appropriate in only the simplest of the measurementcontexts met in modern science and engineering. Sound statistical handling ofmeasurement variability in other data types is also needed and in many casesnew modeling and inference methodology is needed to support this.In one direction, new work is needed in appropriate statistical treatment of

measurement variability in the deceptively simple-appearing case of univariateordinal "measurements" (and even categorical "measurements"). de Mast andvan Wieringen (2010) is an important recent effort in this area and makes someconnections to related literature in psychometrics.In another direction, modern physical measurement devices increasingly pro-

duce highly multivariate essentially real-valued measurements. Sometimes theindividual coordinates of these concern fundamentally different physical prop-erties of a specimen, so that their indexing is more or less arbitrary and ap-propriate statistical methodology needs to be invariant to reordering of thecoordinates. In such cases, classical statistical multivariate analysis is poten-tially helpful. But the large sample sizes needed to reliably estimate large meanvectors and covariance matrices do not make widespread successful metrologyapplications of textbook multivariate analysis seem likely.On the other hand, there are many interesting modern technological multi-

variate measurement problems where substantial physical structure gives nat-ural subject matter meaning to coordinate indexing. In these cases, probabil-ity modeling assumptions that tie successive coordinates of multivariate datatogether in appropriate patterns can make realistic sample sizes workable forstatistical analysis. One such example is the measurement of weight fractions(of a total specimen weight) of particles of a granular material falling into aset of successive size categories. See Leyva et al (2011) for a recent treatmentof this problem that addresses the problem of measurement errors. Anotherclass of problems of this type concerns the analysis of measurements that areeffectively (discretely sampled versions of ) "functions" of some variable, t, (thatcould, for example, be time, or wavelength, or force, or temperature, etc.).Specialized statistical methodology for these applications needs to be de-

veloped in close collaboration with technologists and metrologists almost on acase-by-case basis, every new class of measuring machine and experiment callingfor advances in modeling and inference. Problems seemingly as simple as themeasurement of 3-d position and orientation (essential to fields from manufac-turing to biomechanics to materials science) involve multivariate data and veryinteresting statistical considerations, especially where measurement error is tobe taken into account. For example, measurement of a 3-d location using acoordinate measuring machine has error that is both location-dependent and(user-chosen) probe path-dependent (matters that require careful characteriza-tion and modeling for effective statistical analysis). 3-d orientations are mostdirectly described by orthogonal matrices with positive determinant, and re-

48

quire non-standard probability models and inference methods for the handlingof measurement error. (See, for example Bingham, Vardeman, and Nordman(2009) for an indication of what is needed and can be done in this context.)Finally, much of modern experimental activity in a number of fields (includ-

ing drug screening, "combinatorial chemistry," and materials science) follows ageneral pattern called "high throughput screening," in which huge numbers ofexperimental treatments are evaluated simultaneously so as to "discover" oneor a few that merit intensive follow-up. Common elements of these experi-ments include a very large number of measurements, and complicated or (rel-ative to traditional experiments) unusual blocking structures corresponding to,for example, microarray plates and patterns among robotically produced mea-surements. In this context, serious attention is needed in data modeling andanalysis that adequately accounts for relationships between measurement errorsand/or other sources of random noise, often regarded as "nuisance effects."

6.2 Measurement and Statistical Experimental Design andAnalysis

Consideration of what data have the most potential for improving understand-ing of physical systems is a basic activity of statistics. This is the sub-areaof statistical design and analysis of experiments. We have said little aboutmetrology and statistically informed planning of data collection in this article.But there are several ways that this statistical expertise can be applied (andfurther developed) in collaborations with metrologists.In the first place, in developing a new measurement technology, it is desirable

to learn how to make it insensitive to all factors except the value of a measur-and. Experimentation is a primary way of learning how that can be done, andstatistical experimental design principles and methods can be an important aidin making that effort effi cient and effective. Factorial and fractional factor-ial experimentation and associated data analysis methods can be employed in"ruggedness testing" to see if environmental variables (in addition to a mea-surand) impact the output of a measurement system. In the event that thereare such variables, steps may be taken to mitigate their effects. Beyond thepossible simple declaration to users that the offending variables need to be heldat some standard values, there is the possibility of re-engineering the measure-ment method in a way that makes it insensitive to the variables. Sometimeslogic and physical principles provide immediate guidance in either insulating themeasurement system from changes in the variables or eliminating their impact.Other times experimentation may be needed to find a configuration of measure-ment system parameters in which the variables are no longer important, andagain statistical design and analysis methods become relevant. This latter kindof thinking is addressed in Dasgupta, Miller, and Wu (2010).Another way in which statistical planning has the potential to improve mea-

surement is by informing the choice of which measurements are made. Thisincludes but goes beyond ideas of simple sample size choices. For example,statistical modeling and analysis can inform choice of targets for touching a

49

physical solid with the probe of coordinate measuring machine if characteriza-tion of the shape or position of the object is desired. Statistical modeling andanalysis can inform the choice of a set of sieve sizes to be used in characterizingthe particle size distribution of a granular solid. And so on.

6.3 The Necessity of Statistical Engagement

Most applied statisticians see themselves as collaborators with scientists andengineers, planning data collection and executing data analysis in order to helplearn "what is really going on." But data don’t just magically appear. Theycome to scientists and engineers through measurement and with measurementerror. It then only makes sense that statisticians understand and take aninterest in helping mitigate the ("extra") uncertainty their scientific colleaguesface as users of imperfect measurements. We hope that this article proves usefulto many in joining these efforts.

7 References

Automotive Industry Action Group (2010). Measurement Systems Analysis Ref-erence Manual, 4th Edition, Chrysler Group LLC, Ford Motor Company andGeneral Motors Corporation.

Bingham, M., Vardeman, S. and Nordman, D. (2009). "Bayes One-Sampleand One-Way Random Effects Analyses for 3-d Orientations With Applica-tion to Materials Science," Bayesian Analysis, Vol. 4, No. 3, pp. 607 - 630,DOI:10.1214/09-BA423.

Burr, T., Hamada, M., Cremers, T., Weaver, B., Howell, J., Croft, S. andVardeman, S. (2011). "Measurement Error Models and Variance Estimation inthe Presence of Rounding Effects," Accreditation and Quality Assurance, Vol.16, No. 7, pp. 347-359, DOI 10.1007/s00769-011-0791-0.

Croarkin, M. Carroll (2001). "Statistics and Measurement," Journal of Re-search of the National Institute of Standards and Technology, Vol. 106, No. 1,pp. 279-292.

Dasgupta, T., Miller, A. and Wu, C.F.J. (2010). "Robust Design of Measure-ment Systems," Technometrics, Vol. 52, No. 1, pp. 80-93.

de Mast, J. and van Wieringen, W. (2010). "Modeling and Evaluating Repeata-bility and Reproducibility of Ordinal Classifications," Technometrics, Vol. 52,No. 1, pp. 94-99.

Gelman, Andrew (2006). "Prior Distributions for Variance Parameters in Hi-erarchical Models," Bayesian Analysis, Vol. 1, No. 3, pp. 515-533.

Gertsbakh, Ilya (2002). Measurement Theory for Engineers, Springer-Verlag,Berlin-Heildelberg-New York.

50

Gleser, Leon (1998). "Assessing Uncertainty in Measurement," Statistical Sci-ence, Vol. 13, No. 3, pp. 277-290.

Joint Committee for Guides in Metrology Working Group 1 (2008), JCGM100:2008 (GUM 1995 with minor corrections) Evaluation of Measurement Data– Guide to the Expression of Uncertainty in Measurement, International Bu-reau of Weights and Measures, Sèvres.http://www.bipm.org/utils/common/documents/jcgm/JCGM_100_2008_E.pdf

Leyva, N., Page, G., Vardeman, S. and Wendelberger, J. (2011). "Bayes Sta-tistical Analyses for Particle Sieving Studies," LA-UR-10-06017, submitted toTechnometrics.

Morris, Alan S. (2001). Measurement and Instrumentation Principles, 3rd Edi-tion, Butterworth-Heinemann, Oxford.

Lee, C.-S. and Vardeman, S. (2001). "Interval Estimation of a Normal ProcessMean from Rounded Data," Journal of Quality Technology, Vol. 33, No. 3, pp.335-348.

Lee, C.-S. and Vardeman, S. (2002). "Interval Estimation of a Normal ProcessStandard Deviation from Rounded Data," Communications in Statistics, Vol.31, No. 1, pp. 13-34.

National Institute of Standards and Technology (2003). NIST/SEMATECHe-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/,July 25, 2011.

Taylor, B.N. and Kuyatt, C.E. (1994). Guidelines for Evaluating and Express-ing the Uncertainty of NIST Measurement Results– NIST Technical Note 1297,National Institute of Standards and Technology, Gaithersburg, MD.http://www.nist.gov/pml/pubs/tn1297/index.cfm

Vardeman, Stephen B. (2005). "Sheppard’s Correction for Variances and the‘Quantization Noise Model’," IEEE Transactions on Instrumentation and Mea-surement, Vol. 54, No. 5, pp. 2117-2119.

Vardeman, S.B. and Jobe, J.M. (1999). Statistical Quality Assurance Methodsfor Engineers, John Wiley, New York.

Vardeman, S.B. and Jobe, J.M. (2001). Basic Engineering Data Collection andAnalysis, Duxbury/Thomson Learning, Belmont.

Vardeman, S., Wendelberger, J., Burr, T., Hamada, M., Moore, L., Morris, M.,Jobe, J.M. and Wu, H. (2010). "Elementary Statistical Methods and Measure-ment," The American Statistician, Vol. 64, No. 1, pp. 52—58.

Weaver, B. Hamada, M., Wilson, A. and Vardeman, S. (2011). "A BayesianApproach to the Analysis of Gauge R&R Data," LA-UR-11-00438, submittedto IIE Transactions.

51