statistics for analytical chemistry

The General Analytical ProblemSelect sampleExtract analyte(s) from matrixDetect, identify andquantify analytesDetermine reliability andsignificance of resultsSeparate analytes

Errors in Chemical AnalysisImpossible to eliminate errors.How reliable are our data?Data of unknown quality are useless!Carry out replicate measurementsAnalyse accurately known standardsPerform statistical tests on data

MeanDefined as follows:Where xi = individual values of x and N = number of replicate measurementsMedianThe middle result when data are arranged in order of size (for even numbers the mean of middle two). Median can be preferred whenthere is an outlier - one reading very different from rest. Median less affected by outlier than is mean.

Illustration of Mean and MedianResults of 6 determinations of the Fe(III) content of a solution, known to contain 20 ppm:Note: The mean value is 19.78 ppm (i.e. 19.8ppm) - the median value is 19.7 ppm

_959756090.unknown

PrecisionRelates to reproducibility of results..How similar are values obtained in exactly the same way?Useful for measuring this:Deviation from the mean:

AccuracyMeasurement of agreement between experimental mean andtrue value (which may not be known!).Measures of accuracy:Absolute error: E = xi - xt (where xt = true or accepted value)Relative error: (latter is more useful in practice)

Illustrating the difference between accuracy and precisionLow accuracy, low precisionLow accuracy, high precisionHigh accuracy, low precisionHigh accuracy, high precision

Some analytical data illustrating accuracy and precisionBenzyl isothioureahydrochlorideNicotinic acidAnalyst 4: imprecise, inaccurateAnalyst 3: precise, inaccurateAnalyst 2: imprecise, accurateAnalyst 1: precise, accurate

_959756161.unknown

Types of Error in Experimental DataThree types:(1) Random (indeterminate) ErrorData scattered approx. symmetrically about a mean value.Affects precision - dealt with statistically (see later).(2) Systematic (determinate) ErrorSeveral possible sources - later. Readings all too high or too low. Affects accuracy.(3) Gross ErrorsUsually obvious - give outlier readings.Detectable by carrying out sufficient replicatemeasurements.

Sources of Systematic Error1. Instrument ErrorNeed frequent calibration - both for apparatus such asvolumetric flasks, burettes etc., but also for electronicdevices such as spectrometers.2. Method ErrorDue to inadequacies in physical or chemical behaviourof reagents or reactions (e.g. slow or incomplete reactions)Example from earlier overhead - nicotinic acid does notreact completely under normal Kjeldahl conditions for nitrogen determination.3. Personal Errore.g. insensitivity to colour changes; tendency to estimatescale readings to improve precision; preconceived idea oftrue value.

Systematic errors can be constant (e.g. error in burette reading -less important for larger values of reading) orproportional (e.g. presence of given proportion ofinterfering impurity in sample; equally significantfor all values of measurement)Minimise instrument errors by careful recalibration and goodmaintenance of equipment.Minimise personal errors by care and self-disciplineMethod errors - most difficult. True value may not be known.Three approaches to minimise:analysis of certified standardsuse 2 or more independent methodsanalysis of blanks

Statistical Treatment of Random ErrorsThere are always a large number of small, random errorsin making any measurement.These can be small changes in temperature or pressure; random responses of electronic detectors (noise) etc.Suppose there are 4 small random errors possible.Assume all are equally likely, and that each causes an errorof U in the reading.Possible combinations of errors are shown on the next slide:

Combination of Random ErrorsTotal ErrorNo.Relative Frequency

+U+U+U+U+4U11/16 = 0.0625

-U+U+U+U+2U44/16 = 0.250+U-U+U+U+U+U-U+U+U+U+U-U

-U-U+U+U066/16 = 0.375-U+U-U+U-U+U+U-U+U-U-U+U+U-U+U-U+U+U-U-U

+U-U-U-U-2U44/16 = 0.250-U+U-U-U-U-U+U-U-U-U-U+U

-U-U-U-U-4U11/16 = 0.01625The next overhead shows this in graphical form

Frequency Distribution forMeasurements Containing Random Errors4 random uncertainties10 random uncertaintiesA very large number ofrandom uncertaintiesThis is aGaussian ornormal errorcurve.Symmetrical aboutthe mean.

_959756267.unknown

_959756456.unknown

_959756536.unknown

Replicate Data on the Calibration of a 10ml PipetteNo.Vol, ml.No.Vol, ml.No.Vol, ml

19.988189.975359.97629.973199.980369.99039.986209.994379.98849.980219.992389.97159.975229.984399.98669.982239.981409.97879.986249.987419.98689.982259.978429.98299.981269.983439.977109.990279.982449.977119.980289.991459.986129.989299.981469.978139.978309.969479.983149.971319.985489.980159.982329.977499.983169.983339.976509.979179.988349.983Mean volume9.982 mlMedian volume9.982 mlSpread0.025 mlStandard deviation0.0056 ml

Calibration data in graphical formA = histogram of experimental resultsB = Gaussian curve with the same mean value, the same precision (see later)and the same area under the curve as for the histogram.

SAMPLE= finite number of observationsPOPULATION= total (infinite) number of observationsProperties of Gaussian curve defined in terms of population.Then see where modifications needed for small samples of dataMain properties of Gaussian curve:Population mean (m) : defined as earlier (N ). In absence of systematic error,m is the true value (maximum on Gaussian curve).Remember, sample mean () defined for small values of N.(Sample mean population mean when N 20)Population Standard Deviation (s) - defined on next overhead

s : measure of precision of a population of data,given by:Where m = population mean; N is very large.The equation for a Gaussian curve is defined in terms of m and s, as follows:

Two Gaussian curves with two different standard deviations, sA and sB (=2sA)General Gaussian curve plotted inunits of z, where z = (x - m)/si.e. deviation from the mean of adatum in units of standard deviation. Plot can be used for data with given value of mean,and any standard deviation.

_959756700.unknown

_959756777.unknown

Area under a Gaussian CurveFrom equation above, and illustrated by the previous curves, 68.3% of the data lie within of the mean (), i.e. 68.3% of the area under the curve lies between of .Similarly, 95.5% of the area lies between , and 99.7% between .There are 68.3 chances in 100 that for a single datum the random error in the measurement will not exceed .

The chances are 95.5 in 100 that the error will not exceed .

Sample Standard Deviation, sThe equation for s must be modified for small samples of data, i.e. small NTwo differences cf. to equation for s:1.Use sample mean instead of population mean.2.Use degrees of freedom, N - 1, instead of N.Reason is that in working out the mean, the sum of the differences from the mean must be zero. If N - 1 values areknown, the last value is defined. Thus only N - 1 degreesof freedom. For large values of N, used in calculatings, N and N - 1 are effectively equal.

Alternative Expression for s(suitable for calculators)Note: NEVER round off figures before the end of the calculation

Reproducibility of a method for determining the % of selenium in foods. 9 measurements were made on a single batch of brown rice.SampleSelenium content (mg/g) (xI)xi210.070.004920.070.004930.080.006440.070.004950.070.004960.080.006470.080.006480.090.008190.080.0064

Sxi=0.69Sxi2=0.0533Mean = Sxi/N= 0.077mg/g (Sxi)2/N = 0.4761/9 = 0.0529Standard Deviation of a SampleCoefficient of variance = 9.2% Concentration = 0.077 0.007 mg/gStandard deviation:

Standard Error of a MeanThe standard deviation relates to the probable error in a single measurement.If we take a series of N measurements, the probable error of the mean is less thanthe probable error of any one measurement.The standard error of the mean, is defined as follows:

Pooled DataTo achieve a value of s which is a good approximation to s, i.e. N 20,it is sometimes necessary to pool data from a number of sets of measurements(all taken in the same way).Suppose that there are t small sets of data, comprising N1, N2,.Nt measurements.The equation for the resultant sample standard deviation is:(Note: one degree of freedom is lost for each set of data)

Analysis of 6 bottles of wine for residual sugar.Pooled Standard Deviation

Sheet1

BottleSugar % (w/v)No. of obs.Deviations from mean

10.9430.05, 0.10, 0.08

21.0840.06, 0.05, 0.09, 0.06

31.2050.05, 0.12, 0.07, 0.00, 0.08

40.6740.05, 0.10, 0.06, 0.09

50.8330.07, 0.09, 0.10

60.7640.06, 0.12, 0.04, 0.03

Sheet1

Set nsn

10.01890.097

20.01780.077

30.02820.084

40.02420.090

50.02300.107

60.02050.083

Total0.1326

MBD00160EB8.unknown

Two alternative methods for measuring the precision of a set of results:VARIANCE: This is the square of the standard deviation:COEFFICIENT OF VARIANCE (CV)(or RELATIVE STANDARD DEVIATION):Divide the standard deviation by the mean value and express as a percentage:

Use of Statistics in Data Evaluation

How can we relate the observed mean value () to the true mean (m)?

The latter can never be known exactly.The range of uncertainty depends how closely s corresponds to s.We can calculate the limits (above and below) aroundthat m must lie,with a given degree of probability.

Define some terms:CONFIDENCE LIMITS interval around the mean that probably contains m.CONFIDENCE INTERVALthe magnitude of the confidence limitsCONFIDENCE LEVELfixes the level of probability that the mean is within the confidence limitsExamples later.First assume that the known s is a goodapproximation to s.

Percentages of area under Gaussian curves between certain limits of z (= x - m/s) 50%of area lies between 0.67s80% 1.29s90% 1.64s95% 1.96s99% 2.58sWhat this means, for example, is that 80 times out of 100 the true mean will liebetween 1.29s of any measurement we make.Thus, at a confidence level of 80%, the confidence limits are 1.29s.For a single measurement: CL for m = x zs (values of z on next overhead)For the sample mean of N measurements (), the equivalent expression is:

Values of z for determining Confidence LimitsConfidence level, %z

500.67681.0801.29901.64951.96962.00992.5899.73.0099.93.29Note: these figures assume that an excellent approximationto the real standard deviation is known.

Atomic absorption analysis for copper concentration in aircraft engine oil gave a value of 8.53 mg Cu/ml. Pooled results of many analyses showed s s = 0.32 mg Cu/ml.Calculate 90% and 99% confidence limits if the above result were based on (a) 1, (b) 4, (c) 16 measurements.(a)(b)(c)Confidence Limits when s is known

If we have no information on s, and only have a value for s - the confidence interval is larger,i.e. there is a greater uncertainty.Instead of z, it is necessary to use the parameter t, defined as follows:t = (x - m)/si.e. just like z, but using s instead of s.By analogy we have:The calculated values of t are given on the next overhead

Values of t for various levels of probabilityDegrees of freedom80%90%95%99%(N-1)13.086.3112.763.721.892.924.309.9231.642.353.185.8441.532.132.784.6051.482.022.574.0361.441.942.453.7171.421.902.363.5081.401.862.313.3691.381.832.263.25191.331.732.102.88591.301.672.002.661.291.641.962.58Note:(1)As (N-1) , so t z(2)For all values of (N-1) < , t > z, I.e. greater uncertainty

Analysis of an insecticide gave the following values for % of the chemical lindane: 7.47, 6.98, 7.27. Calculate the CL for the mean value at the 90% confidence level.Sxi = 21.72Sxi2 = 157.3742If repeated analyses showed that s s = 0.28%:Confidence Limits where s is not known

Sheet1

xi%xi2

7.4755.8009

6.9848.7204

7.2752.8529

Testing a HypothesisCarry out measurements on an accurately known standard.Experimental value is different from the true value.Is the difference due to a systematic error (bias) in the method - or simply to random error?Assume that there is no bias(NULL HYPOTHESIS),and calculate the probabilitythat the experimental erroris due to random errors.Figure shows (A) the curve forthe true value (mA = mt) and(B) the experimental curve (mB)

_960015640.unknown

Bias = mB- mA = mB - xt.Remember confidence limit for m (assumed to be xt, i.e. assume no bias)is given by:

A standard material known to contain 38.9% Hg was analysed byatomic absorption spectroscopy. The results were 38.9%, 37.4%and 37.1%. At the 95% confidence level,is there any evidence for a systematic error in the method?Assume null hypothesis (no bias). Only reject this ifBut t (from Table) = 4.30, s (calc. above) = 0.943% and N = 3Therefore the null hypothesis is maintained, and there is no evidence for systematic error at the 95% confidence level.Detection of Systematic Error (Bias)

Are two sets of measurements significantly different?Suppose two samples are analysed under identical conditions.Are these significantly different?Using definition of pooled standard deviation, the equation on the last overhead can be re-arranged:Only if the difference between the two samples is greater than the term onthe right-hand side can we assume a real difference between the samples.

Test for significant difference between two sets of dataTwo different methods for the analysis of boron in plant samples gave the following results (mg/g):(spectrophotometry)(fluorimetry)Each based on 5 replicate measurements.At the 99% confidence level, are the mean values significantlydifferent?Calculate spooled = 0.267. There are 8 degrees of freedom,therefore (Table) t = 3.36 (99% level).Level for rejecting null hypothesis isi.e. 0.5674, or 0.57 mg/g.Therefore, at this confidence level, there is a significant difference, and there must be a systematic error in at least one of the methods of analysis.

A set of results may contain an outlying result - out of line with the others. Should it be retained or rejected? There is no universal criterion for deciding this. One rule that can give guidance is the Q test.Consider a set of resultsThe parameter Qexp is defined as follows:Detection of Gross Errors

_933508559.unknown

_960876987.unknown

Qexp is then compared to a set of values Qcrit:Rejection of outlier recommended if Qexp > Qcrit for the desired confidence level.Note:1.The higher the confidence level, the less likely is rejection to be recommended.2. Rejection of outliers can have a marked effect on mean and standard deviation, esp. when there are only a few data points. Always try to obtain more data.3. If outliers are to be retained, it is often better to report the median value rather than the mean.

Qcrit (reject if Qexpt > Qcrit)

No. of observations90%95%99% confidencelevel

30.9410.9700.99440.7650.8290.92650.6420.7100.82160.5600.6250.74070.5070.5680.68080.4680.5260.63490.4370.4930.598100.4120.4660.568

The following values were obtained for the concentration of nitrite ions in a sample of river water: 0.403, 0.410, 0.401, 0.380 mg/l.Should the last reading be rejected?But Qcrit = 0.829 (at 95% level) for 4 valuesTherefore, Qexp < Qcrit, and we cannot reject the suspect value.Suppose 3 further measurements taken, giving total values of:0.403, 0.410, 0.401, 0.380, 0.400, 0.413, 0.411 mg/l. Should0.380 still be retained?But Qcrit = 0.568 (at 95% level) for 7 valuesTherefore, Qexp > Qcrit, and rejection of 0.380 is recommended.But note that 5 times in 100 it will be wrong to reject this suspect value!Also note that if 0.380 is retained, s = 0.011 mg/l, but if it is rejected,s = 0.0056 mg/l, i.e. precision appears to be twice as good, just by rejecting one value.Q Test for Rejection of Outliers

Obtaining a representative sampleHomogeneous gaseous or liquid sampleNo problem any sample representative.Solid sample - no gross heterogeneityTake a number of small samples at random from throughout the bulk - this will give a suitable representative sample.Solid sample - obvious heterogeneityTake small samples from each homogeneous region and mix these in the same proportions as between each region and the whole.If it is suspected, but not certain, that a bulk material is heterogeneous, then it is necessary to grind the sample to a fine powder, and mix this very thoroughly before taking random samples from the bulk.For a very large sample - a train-load of metal ore, or soil in a field - it is always necessary to take a large number of random samples from throughout the whole.

Sample Preparation and ExtractionMay be many analytes present - separation - see later.May be small amounts of analyte(s) in bulk material. Need to concentrate these before analysis.e.g. heavy metals inanimal tissue, additives in polymers, herbicide residues in flour etc. etc.May be helpful to concentrate complex mixtures selectively.Most general type of pre-treatment: EXTRACTION.

Classical extraction method is:SOXHLET EXTRACTION(named after developer). ApparatusSample in porous thimble. Exhaustive reflux for up to 1 - 2 days. Solution of analyte(s) in volatile solvent (e.g. CH2Cl2, CHCl3 etc.) Evaporate to dryness or suitable concentration, for separation/analysis.

_960877268.unknown

statistics for analytical chemistry

Documents