ch. 4: statistics - university of...

Ch. 4: StatisticsOutline:

• 4-1 Gaussian Distribution

• 4-2 Confidence Intervals

• 4-3 Means and Studentʼs t-test

• 4-4 Standard Deviations and the F-test

• 4-5 Excel exercise: t tests with a spreadsheet

• 4-6 Grubbs Test for an outlier

• 4-7 Method of Least Squares

• 4-8 Calibration Curves

• 4-9 Excel exercise: least squares on a spreadsheet

Updated Oct. 7, 2011: minor fixes to slides 16, 36

Gaussian DistributionIf an experiment is done a large number of times, the resulting value tend to cluster around an average value in a symmetric fashion. Many repetitions result in a Gaussian distribution or the so-called bell curve.

In practice, in the lab you will be making between 3-5 measurements (not thousands); nonetheless, the parameters that estimate a larger set of data can be easily obtained from these measurements.

Mean & Standard DeviationThe arithmetic mean (or average) is defined by the sum of the measured values, xi, divided by the number of measurements, n.

The plot to the right shows the number of bulbs plotted as a function of lifetime. Though the two sets of bulbs have the same mean lifetime, it is clear that the set of bulbs with s = 47.1 h has undergone a more uniform manufacturing process than the set with s = 94.2 h.

x =xi

i∑n

s =(xi − x )2

i∑

n −1

The standard deviation gives a measure of how closely the data is clustered about the mean (i.e., the precision), with a small s corresponding to a tight clustering.

Mean & Standard Deviation, 2If the dataset is infinite (i.e., n = ∞), then the mean and standard deviation are designated by the Greek letters μ and σ. Of course, it is not possible in practice to measure these values; however, as the number of measurements increase, the values of and s approach the values of μ and σ.

x

Other important statistical values:

The quantity n - 1 in the denominator of s is referred to as the degrees of freedom.

The square of the standard deviation, s2, is known as the variance.

Finally, if the standard deviation is expressed as a percentage of the mean, this is called the coefficient of variation or the relative standard deviation.

sx×100%

Significant figures: Experimental results are expressed in the form ±s(n=_), where n is the number of data points. e.g., 823 ± 30 (n = 4) or 8.2 (±0.3) × 102 (n = 4) indicates that the mean has just two significant figures.

x

ProbabilityThe Gaussian curve is described by the formula

y = 1σ 2π

e−(x−µ )2 /2σ 2

A Gaussian curve in which μ = 0 and σ = 1. A Gaussian curve whose area is unity is called a normal error curve. The abscissa z = (x − μ)/σ is the distance away from the mean, measured in units of the standard deviation. When z = 2, we are two standard deviations away from the mean.

e is the base of the natural logarithm, and μ and σ are approximated by and s.xThe graph of this equation (with μ = 0 and σ = 1) has a maximum value of y when x = μ, and the curve is symmetric about this value.

Deviations from the mean are often expressed at multiples of the standard deviation, z:

z = x − µσ

≈x − xs

i.e., so when z = +1, x is one standard deviation above the mean, and when z = −2, x is two standard deviations below the mean.

Probability, 2The probability of measuring z in a certain range is equal to the area of that range.e.g., the probability of observing z between −2 and −1 is 0.136 (corresponds to the shaded area on the previous slide). The area under each portion of the curve is shown below:

Probability, 3The sum of all probabilities must be equal to unity (1 or 100%); hence, the area under the curve that goes from z = -∞ to z = +∞ must also be equal to 1. The factor 1/(σ√2π) is called the normalization factor, and ensures that this is the case.

The standard deviation is therefore a measure of the width of the curve; the larger the standard deviation, the broader the curve, and the less the precision.

Interestingly, in any Gaussian curve (no matter what the nature of the data is comprising it), 68.3% of the area is in the range of 1 standard deviation (i.e., between μ - 1σ and μ + 1σ), meaning that greater than two thirds of the data are expected to lie within 1 standard deviation of the mean.

Standard Deviation of the MeanTo measure the mean life time of a collection of light bulbs, we could measure each oneʼs lifetime, and compute the average. Alternatively, we could select four at a time, measure the lifetime of each, and compute the average of lifetime of the four (and repeat this for many sets of four). From these data, we compute μ and σ4 (the subscript 4 indicates sets of 4).

The means that are calculated from both of these methods work out to be the same, but the standard deviations are different. In this case, σ4 = σ/√4.

Thus, σ4 is the standard deviation of the mean for sets of 4 samples. The standard deviation of the mean for sets of n samples is expressed as:

σ n =σn

The more measurements that are made, the higher the confidence that the average is close to the real mean. Uncertainty decreases proportional to 1/√n, where n is the number of measurements. (e.g., uncertainty decreases by 10x with 100 measurements!)

Confidence IntervalsStudentʼs t is used to express confidence intervals and to compare results from different experiments.

“Student” was the pseudonym of W. S. Gosset, whose employer, the Guinness breweries of Ireland, restricted publications for proprietary reasons. Because of the importance of Gosset’s work, he was allowed to publish it (Biometrika 1908, 6, 1), but under an assumed name.

For a limited number of measurements, n, we determine the sample mean and the sample standard deviation, s. The confidence interval is computed from

x

Confidence interval = x ± tsn

where t is Studentʼs t, which is selected from the table on the next slide for a desired level of confidence.What does the confidence interval tell us? e.g., The 95% confidence interval would include the true population mean (unknown value) in 95% of the sets of n measurements.

Confidence IntervalsPick a Studentʼs t value by selecting a confidence level, and determining the number of degrees of freedom (e.g., if n = 21, n - 1 = 20, CL = 98%, then t = 2.528).

Confidence Intervals and ExcelExample: The carbohydrate content of a glycoprotein (a protein with sugars attached to it) is found to be 12.6, 11.9, 13.0, 12.7, and 12.5 wt% (g carbohydrate/100 g glycoprotein) in replicate analyses. Find the 50% and 90% confidence intervals for the carbohydrate content.

Meaning of Confidence IntervalsA computer chose numbers at random from a Gaussian population with a population mean (μ) of 10 000 and a population standard deviation (σ) of 1 000.

Trial 1: 4 numbers, and s calculated, CI = 50%, 3 DOF, t = 0.765, (left most) Experiment repeated for 100 trials, so 50 (50% for infinite # of experiments) should include the true population mean of 10000 (in fact, in this test, there are 45 white blocks).

x x = 9526

Includes pop. mean

Does not include pop. mean

Same experiment conducted with CI = 90% (which for an infinite # of experiments should have 90% including the true population mean); 89 were found to include the true population mean.

SD, CI and Experimental UncertaintySuppose you measure the volume of a vessel five times and observe values of 6.375, 6.372, 6.374, 6.377, and 6.375 mL.

For 5 measurements,DOF = 4, CL = 95%,x = 6.3746 mL, s = 0.0018 mLStudentʼs t = 2.776, CI = 0.0023

For 21 measurements, reduced uncertaintyDOF = 20, CL = 95%,x = 6.3746 mL, s = 0.0018 mLStudentʼs t = 2.086, CI = 0.0008

Comparison of meansIf two sets of measurements are made on the same sample, the mean from one set will generally not be equal to the mean from the other set (small random errors!).

We can use the t test to compare the mean values to decide if there is a statistically significant difference between them (i.e., do they agree within experimental error?).

The null hypothesis states that the mean values from two sets of measurements are not different. Statistics gives us a probability that the observed difference between two means arises from random measurement error. We reject the null hypothesis if there is less than a 5% chance that the observed difference arises from random variations (i.e., there is a 95% chance that our conclusion is correct, or 1 time out of 20 when we conclude that two means are not different we will be wrong).

In the field of statistics, the null hypothesis is assumed to be true. Unless you find strong evidence that it is not true, you continue to believe that it is true. In the U.S. legal system, the null hypothesis is that the accused person is innocent. It is up to the prosecution to produce compelling evidence that the accused person is not innocent; failing that, the jury must acquit the defendant.

Comparison of means, 2

Case 1: Comparing a measured result with a “known” valueA quantity is measured several times, mean and standard deviation are obtained. We need to compare our answer with an accepted answer. The average is not exactly the same as the accepted answer. Does our measured answer agree with the accepted answer “within experimental error”?

Case 2: Comparing replicate measurementsA quantity is measured multiple times by two different methods that give two different answers, each with its own standard deviation. Do the two results agree with each other “within experimental error”?

Case 3: Paired t test for computing individual differenceSample A is measured once by method 1 and once by method 2; the two measurements do not give exactly the same result. Sample B, is measured once by method 1 and once by method 2; and, again, the results are not exactly equal. The procedure is repeated for n different samples. Do the two methods agree with each other “within experimental error”.

Case 1Case 1: Comparing a measured result with a “known” valueYou purchased a Standard Reference Material coal sample certified by the National Institute of Standards and Technology to contain 3.19 wt% sulphur. You are testing a new analytical method to see whether it can reproduce the known value. The measured values are 3.29, 3.22, 3.30, and 3.23 wt% sulphur, with x = 3.260 and s = 0.041.

Does your answer agree with the known answer? To find out, compute the 95% confidence interval for your answer and see if that range includes the known answer. If the known answer is not within your 95% confidence interval, then the results do not agree.

95% confidence interval =

x ± tsn= 3.260 ±

(3.182)(0.041)4

= 3.260 ± 0.065

so the 95% confidence interval = 3.195 to 3.325 wt%

The known answer (3.19 wt%) is just outside the 95% confidence interval; therefore, there is less than a 5% chance that our method agrees with the known answer. Thus, we conclude that our method gives a “different” result from the known result. However, in this case, the 95% confidence interval is so close to including the known result that it would be prudent to make more measurements before concluding that our new method is not accurate.

Case 2Case 2: Comparing replicate measurementsLord Rayleigh (John W. Strutt) discovered the element Argon (Nobel Prize in 1904) during a time when it was thought that dry air was 1/5 O2 and 4/5 N2.1. He took air and removed oxygen with hot copper (to make CuO) and collected the remaining gas and accurately measured its weight and density.2. He compared this to pure N2 gas from chemical decomposition of nitrous oxide, nitric oxide and ammonium nitrite.

The average mass collected from air (2.31011 g) was 0.46% greater than the average mass of the same volume of gas from chemical sources (2.29947 g).

Experiments were done with great care and repeated many times - Rayleigh understood that the discrepancy was outside his margin of error, and he postulated that gas collected from the air was a mixture of N2 with a small amount of a heavier gas, which turned out to be Ar.

Case 2, 2We will use the t-test to see if the gas isolated from air is “significantly” different than nitrogen isolated from chemical sources. There are two sets of measurements, no “known” value for comparison, and it is assumed that the σ from each set are similar.

Using two sets of data with n1 and n2 measurements, t is calculated with:

tcalc =| x1 − x2 |spooled

n1n2n1 + n2

where spooled is a pooled standard deviation that makes use of both sets of data.

spooled =(xi − x1)2 + (x j − x2 )2

set 2∑

set 1∑

n1 + n2 − 2=

s12 (n1 −1) + s2

2 (n2 −1)n1 + n2 − 2

The tcalc is compared to the t from the table for n1 + n2 - 2 degrees of freedom. If tcalc is greater than ttable at CL = 95%, then the results are considered to be different (i.e., less than a 5% chance that the two sets of data have the same population mean).

N.B., There are also sets of equations for cases where the values of σ from each set are not similar to one another.

Case 3Case 3: Paired t test for computing individual differenceTwo separate methods are used to measure the same values for multiple samples. For this example, we look at aluminum concentrations in drinking water.Results are similar, but not identical. Hence, to see if the difference is significant (i.e., outside of experimental error), the paired t test. To get started, the differences between methods are tabulated, and the mean and S.D. are determined for these differences. Then,

tcalc =| d |sd

n =2.4916.748

11 = 1.224

Note that the absolute value of the mean of the differences is in the numerator, so tcalc > 0.

Since tcalc = 1.224 is less than ttable = 2.228 (for CL = 95% and 10 DOF); hence, there is a more than 5% chance that the two sets of data lie “within experimental error” of one another, meaning that the results are not significantly different (different methods work well!)

Significance: 1- and 2-TailsfThe curve below (a) is the t distribution for 3 DOF. If the certified value lies in the outer 5% of the area under the curve, we reject the null hypothesis and conclude with 95% confidence that the measured mean is not equivalent to the certified value.

The critical value of t for rejecting the null hypothesis is 3.182 for 3 degrees of freedom (see Table 4-2, slide 10).In (a), 2.5% of the area beneath the curve lies above t = −3.182 and 2.5% of the area lies below t = −3.182. We call this a two-tailed test because we reject the null hypothesis if the certified value lies in the low-probability region on either side of the mean.

We will discuss 2-tailed t-tests for the most part.

Red Blood Cell CountsIs todayʼs blood cell count anomalously high? Or, given the set of data, is the count regularly expected to get up to 5.6 × 106 cells/μL

Is todayʼs blood cell count anomalously high? Or, given the set of data, is the count regularly expected to get up to 5.6 × 106 cells/μL

tcalc =| today's count − x |

sn =

| 5.16 − 5.6 |0.23

5 = 4.28

In Table 4-2 (slide 10), looking across the row for 4 degrees of freedom, we see that 4.28 lies between the 98% (t = 3.747) and 99% (t = 4.604) confidence levels. Todayʼs red cell count lies in the upper tail of the curve containing less than 2% of the area of the curve. There is less than a 2% probability of observing a count of 5.6 × 106 cells/μL on “normal” days. It is reasonable to conclude that todayʼs count is elevated.

F test and Standard DeviationsThe F test tells us whether two standard deviations are “significantly” different from each other. F is the quotient of the squares of the standard deviations:

Fcalc =s12

s22

The larger standard deviation is placed in the numerator so that F ≥ 1. The hypothesis that s1 > s2 is tested using the one-tailed Ftest in the table on the next slide. If Fcalculated > Ftable, then the difference is significant, and different formulae (from those on slide 18) must be used.

F test and Standard Deviations, 2

Grubbs Test for OutliersStudents dissolved zinc from a galvanized nail and measured the mass lost by the nail to tell how much of the nail was zinc. Here are 12 results:

Mass loss (%): 10.2, 10.8, 11.6, 9.9, 9.4, 7.8, 10.0, 9.2, 11.3, 9.5, 10.6, 11.6

Outlier

Question: Should we discard or retain this result?

1. Compute the average: = 10.16, s = 1.12. Calculate the Grubbs statistic:

Gcalculated =| questionable value − x |

s3. If Gcalculated > Gtable, then discard the point.Above Gcalculated = 2.13 and Gtable (N = 12) = 2.285, so the point should be retained, and there is more than a 5% chance that the value is a member of the same population as the other values. But...use common sense!

Method of Least Squares

For most chemical analyses, the response of the procedure must be evaluated for known quantities of analyte (called standards) so that the response to an unknown quantity can be interpreted. We prepare a calibration curve, which ideally is linear in the region of interest.

The method of least squares is used to predict the “best” straight line through a dataset, though some points of course will scatter from the line.

In this section, we learn to estimate the uncertainty in a chemical analysis from the uncertainties in the calibration curve and in the measured response to replicate samples of unknown.

Calibration curves for analysis of caffeine and theobromine content (see lecture 0). The black points are from the standards, and the blue points are from the unknown.

Method of Least Squares, 2

To use this procedure, we assume:

1. The uncertainties in y are greater than those of x, which is typically the case in analytical chemistry (i.e., response of an instrument vs. weight/volume).

2. The uncertainties (std. dev.) in all of the y values are similar.

The Gaussian curve drawn over the point (3,3) is a schematic indication of the fact that each value of yi is normally distributed about the straight line. That is, the most probable value of y will fall on the line, but there is a finite probability of measuring y some distance from the line.

Method of Least Squares, 3The equation of a straight line is:

y = mx + b

The vertical deviation for point xi, yi is yi - y, where y is the ordinate of the line when x = xi.

vertical deviation = di = yi − y = yi − (mxi + b)

Since the deviations about the line have equal chances of being positive or negative, we wish to minimize the magnitude of deviations (irrespective of sign), so we take the squares (all +ve numbers):

d 2i = (yi − y)2 = (yi − mxi − b)

2

Hence, since we are minimizing the squares of the deviations, this technique is known as the method of least squares.

Method of Least Squares, 4We omit the calculus used to derive this method, and express the final solutions in terms of determinants, which are defined as:

e fg h

= eh − fg; e.g., 6 54 3

= (6 × 3) − (5 × 4) = −2

The slope and the intercept of the “best” straight line for a least squares fit are:

m =(xiyi )∑ xi∑yi∑ n

÷ D and b =(xi

2 )∑ (xiyi )∑xi∑ yi∑

÷ D

where n is the number of points and D is

D =(xi

2 )∑ xi∑xi∑ n∑

From the manual calculation below,y = 0.61538x + 1.34615

Method of Least Squares, 5The population standard deviation of all y values, σy, characterizes the little Gaussian curve on the linear regression plot on slide 26, and is dependent upon the uncertainties in m and b. We estimate it by calculating the standard deviation, sy, for all measured values of y, where the deviation of each yi from the centre of the Gaussian curve is di = yi − y = yi − (mxi + b):

σ y ≈ sy =(di − d∑ )2

(degrees of freedom)

where d bar is the average deviation, and is equal to 0 for a straight line. This means that the numerator above can simply be expressed as . The degrees of freedom are the number of independent pieces of information (e.g., n - 1 in addition to the average, or n - 2 if both the slope and intercept are known.

(di2 )∑

The uncertainty associated with the y values, and the standard deviations of m and b are:

sy =(di

2 )∑n − 2

; sm2 =

sy2nD

; sb2 =

sy2 (xi

2 )∑D

The first decimal place of the standard deviation is the last significant figure of the slope or intercept.

Method of Least Squares, 6As you are probably already aware, it is trivial to perform least squares analyses using the Microsoft Excel program. For simple averages and intercepts, the AVERAGE and INTERCEPT commands are used. For more information on the standard deviations of the mean and intercept, the LINEST command is used. LINEST is an array formula (see instructions below), and is input as LINEST (y values, x values, TRUE, TRUE), where the values are expressed the usual way (as a range of cells, e.g., B2:B5, etc.).

Calibration CurvesConsider the set of data below, from spectrophotometric measurements of absorbance of light by aqueous protein samples (absorbance is proportional to protein concentration).

This data involves known concentrations of analyte (as well as a blank), and can be used to construct a calibration curve. Note the outlier marked in parentheses. Based on the curve on the right, this point (0.392) can be omitted from further calculations.

In addition, it seems that the average absorbance value for the 25.0 μg sample seem a bit low - however, repetition of this analysis shows that this always happens, and the original data is good.

The corrected absorbance is the absorbance subtracted by the value for the blank sample.

Calibration Curves, 2Constructing a calibration curve:

Step 1: Prepare known samples of analyte covering a range of concentrations expected for unknowns. Measure the response of the analytical procedure to these standards.

Step 2: Subtract the average absorbance of the blank samples (0.0993) from each measured absorbance to obtain corrected absorbance.

Step 3: Make a graph of corrected absorbance versus quantity of protein analyzed and use the least-squares procedure to find the best straight line through the linear portion of the data (up to and including 20.0 μg of protein).

The equation of the solid straight line fitting the 14 data points (open circles) from 0 to 20 μg, derived by the method of least squares, is y = 0.01630 (±0.00022)x + 0.0047 (±0.0026) with sy = 0.0059. The equation of the dashed quadratic curve that fits all 17 data points from 0 to 25 μg, determined by a nonlinear least squares procedure is y = −1.17(±0.21) × 10−4 x2 + 0.01858 (±0.00046)x − 0.0007 (±0.0010) with sy = 0.0046.

y(±sy ) = [m(±sm )]x + [b(±sb )]

Calibration Curves, 3

Calibration Curves, 4

The linear range of an analytical method is the range over which the response to analyte concentration is linear, whereas the dynamic range simply describes the region over which there is a response - even if it is not linear. In this latter case, one may use non-linear regression methods (see text) to fit curves that deviate from linearity.

Finally, the uncertainty in x (sx) can be calculated from this unwieldy equation:

sx =sy| m |

1k+1n+

(y − y )2

m2 (xi − x∑ )2

where |m| is the absolute value of the slope, k is the number of replicate measurements of the unknown, n is the number of data points for the calibration line. The x and y values have their usual meanings.

Good practice1. Always make a graph of your data. The graph gives you an opportunity to reject bad data or the stimulus to repeat a measurement or decide that a straight line is not an appropriate function.

2. Do not extrapolate any calibration curve, linear or nonlinear, beyond the measured range of standards.

3. At least six calibration concentrations and two replicate measurements of unknown are recommended. The most rigorous procedure is to make each calibration solution independently from a certified material.

4. Avoid serial dilution of a single stock solution. Serial dilution propagates any systematic error in the stock solution.

5. Measure calibration solutions in random order, not in consecutive order by increasing concentration.

Least squares and Excel

95% confidence interval for x: x ± tsx = 2.2325 ± (4.303)(0.3735) = 2.2 ±1.6DOF: 2

ch. 4: statistics - university of...

Documents