with thanks to linda m thienpont [email protected] kristian linnet, md, phd...

56
With thanks to Linda M Thienpont [email protected] Kristian Linnet, MD, PhD [email protected] Per Hyltoft Petersen, MSc [email protected] Sverre Sandberg, MD, PhD [email protected] Laboratory Statistics & Graphics with EXCEL ® Tutorial book Dietmar Stöckl Dietmar@stt- consulting.com

Upload: domenic-johns

Post on 25-Dec-2015

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

With thanks to

• Linda M [email protected] • Kristian Linnet, MD, PhD

[email protected]• Per Hyltoft Petersen, MSc

[email protected]• Sverre Sandberg, MD, [email protected]

Laboratory Statistics & Graphics with

EXCEL®

Tutorial book

Dietmar Stö[email protected]

Page 2: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 2

STT ConsultingDietmar Stöckl, PhD

Abraham Hansstraat 11B-9667 Horebeke, Belgium

e-mail: [email protected]

Page 3: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 3

Content overview

How to use this book

How to use the EXCEL-files

Univariate Statistics

• Data, data presentation, and data description

• Gaussian (or Normal) distribution

• Tests for normality & calculations with logarithms

• Sampling statistics: Confidence intervals

• Estimation and hypothesis testing (F-test, Chi2-test, t-tests, outliers)

• Analysis of variance (ANOVA)

• Statistical power concept and sample size

Bivariate Statistics

• Graphical techniques

• Combined graphical/statistical techniques

• Correlation

• Regression

Annex

Content

Page 4: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 4

Content overview

Content

EXCEL-files

General• DataGeneration• StatFunctions&Tables

ANOVA• Cochran&Bartlett• ANOVA

Data, -presentation & -description• Datasets• Data&DataPresentation• Graphs-EXCEL• ProbPlots

Power & sample size• Power

Gaussian distribution• Exercises-BasicStats• Gaussian Distribution• NormalRankitPlot

Graphical techniques• GraphBivariate-EXCEL

Sampling statistics• SamplingStatistics• CI-Calculator

Combined graphical/statistical techniques• Bland&Altman

Estimation & hypothesis testingand confidence intervals Correlation & regression

• CI&NHST-EXCEL• CI&NHST• CI&NHST-Exercise• Grubbs: (http://www.graphpad.com/articles/outlier.htm)

• Correlation&Regression• CorrRegr-EXCEL

Page 5: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 5

Detailed content

Univariate statistics

Data• Types of data & types of statistics• Exemplary laboratory data

–"Repeated weighing"-experiment–Adult serum triacylglycerides

Data presentation (univariate data)• Importance of digits• Table• Graphics with EXCEL®

–Dot-plot–Histogram–Frequency polygon–Dynamic histogram–Box and whisker plot

• Time-indexed plots

Data description• Descriptive statistics

–LocationMean, median, mode–DispersionRange, variance, standard deviation, coefficient of variation

• Equations• Descriptive statistics with EXCEL®• Importance of digits

Gaussian (or Normal) distribution• "Bell-shaped" (similar to a histogram)• Cumulated: "S-shaped"• Cumulated & linearized • 2-sided and 1-sided probabilities• Inside/outside probabilities • Probabilities at selected s (z) values• Deviation from normality (skewness & kurtosis)

Tests for normality

Calculations with logarithms

Content

Page 6: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 6

Detailed content

Sampling statistics: Confidence intervals • t-distribution (distribution of means)

–Confidence interval of a mean–Confidence interval of the "1.96 s-limit"

• Chi2(2)-distribution (distribution of variances)–Confidence interval of a standard deviation

• Interpretation of confidence limits

Estimation and hypothesis testing (F-test, Chi2-test, t-tests)• Introduction• t-tests• Outlier tests (k • SD, Grubbs, Dixon's Q)• F-test, 2 (=Chi2)-test• Tests and confidence limits

Analysis of Variance (ANOVA)• Introduction• Model I ANOVA

Performance strategy–Testing of outliers–Testing of variances (Cochran "C", Bartlett)

• Model II ANOVA–Applications

Power and sample size

Bivariate Statistics

Graphical techniques• Scatter-plot• Difference-plot• Residual-plot• Krouwer-plot• Influences on the plots (data-range; subgroups; outliers; scaling)• Influences of random- and systematic errors on the plots• Linearity• Specifications in plots

Content

Page 7: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 7

Detailed content

Bivariate statistics (ctd.)

Combined graphical/statistical techniques• The Bland & Altman approach

Correlation• The statistical model• Correlation in method comparison• Non-parametric correlation

Regression• Ordinary linear regression (OLR)• Deming regression• Passing-Bablok regression (non-parametric)• Weighted regression• Regression & method comparison• Regression & calibration

Annex

Statistics with EXCEL®• EXCEL® installation requirements• Tips for EXCEL®-graphics

Statistical resources• Web resources

–Glossary of statistical terms–Interesting educational resources

• Statistical software–General– "Laboratory statistics"

• Books

Statistical tables

Presenter's publications & courses related to the topic

Content

Page 8: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 8

Overview

Basics Data Quantitative

Statistics

Data presentation (univariate)

Categorical

Exploratory Data Analysis

Parametric

Non-parametric

Bayesian

[Importance] Digits

Table

Dot-plot

Histogram/Frequency polygon

Frequency cumulated

Normal probability plot (Rankit)

Krouwer (mountain) plot

Box & whisker plot

Time-indexed plot

2 x 2 Table

Scatter plot

Residual plot

Bias plot

Data presentation (bivariate)

Bland & Altman plot/approach

Data transformation Logarithms

Variance pooling

Variance propagation

Other

Total error calculations

Page 9: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 9

Overview

Descriptivestatistics Location Mean

Dispersion

Median

Minimum/Maxiumum

Range; quartile/quantile

Variance

Standard deviation (SD, or s)

Mode

Coefficient of variation (CV)

Graphics

z-value

>Data presentation (univariate)

2-sided

1-sided

Inside

Outside

Probabilities

Deviations

Confidence intervals

At selected z-values

Skewness

Kurtosis

Central limit theorem

t-distribution

Conf. interval of a mean

Conf. interval of a centile (1.96)

Chi-square (2) distribution

Conf. interval of a SD

Gaussiandistribution

Samplingstatistics

Page 10: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 10

Overview

Means (n>2: ANOVA) 1-sample t-test

SDs

Wilcoxon signed rank

Mann-Whitney U

Paired t-test

Wilcoxon signed rank

1-sample F-test (Chi-square~)

t-test

F-test

Significancetesting

Distribution Chi-square

Kolmogorov Smirnov

Anderson Darling

Grubbs test

Dixon’s Q-test

ANOVA

Correlation (r)

Regression

Outlier

One-way

Kruskal-Wallis

Model I versus Model II

Pearson

Spearman, Kendall

Ordinary linear regression

Deming regression

Passing Bablok regression

Weighted regression forms

Non-parametric

Non-parametric

Non-parametric

Non-parametric

Non-parametric

Non-parametric

Non-parametric

Variances (n > 2) Cochran "C"

Bartlett

Model I: Significance testing(means n > 2)Model II: Variance estimation

Power Sample size calculation

Page 11: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 11

General considerations to approach data

Frequent statistical questions• Which kind of data?• Which kind of distribution?• Is there a difference?• Was there a change?• Is there an association?• What is the probability?

Overview

Which kind of data

• Quantitative–Measured–Counted

• Categorical–Ordinal–Nominal

Appropriate statistic

Plot data• Outliers• Distribution(n > 20)

• Parametric direct or–Remove outliers–Transform data

• Non-parametric–[Remove outliers]

Which question

• Description• Difference• Change• Association• Prediction• Sample size

• Selection of plot• Selection of test• Selection of probability

Data collection/Kind of experiment

• Retrospective• Experience• Statistically planned

• Sufficient digits • Sample size calculations

Approach for quantitative, measured data

Kind of data

Page 12: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 12

Summary of significance tests

Problem Parametric Non-parametric Graphic

Outlier GrubbsDixon’s Q

Dot-plot

Distribution CHI2 Anderson-Darling(recommended)Kolmogorov-Smirnov

Normal probability plot(=Rankit-plot)

Mean$ vs target t-test, 1-sample Wilcoxon signed rank Confidence interval (CI)

2 Means$ t-test (equal & unequal variances): perform Ftest before

Mann-Whitney U CI

Paired means$(Change)

Paired t-test Wilcoxon signed rank CI

SD:VAR vs target CHI2 CI

2 SDs/VAR F-test Siegel-Tukey CI

>2 means$ ANOVA Kruskal-Wallis

>2 variances Cochran’s CBartlett

Rankit-plot with CHI2-function

Association Pearson Correlation Spearman or Kendall

Prediction Regression Passing-Bablok regression

$: or median; SD = standard deviation; VAR = variance; vs = versus

Summary of graphics

Univariate data Bivariate data

Dot-plot Scatter plot

Histogram/Frequency polygon Difference (bias) plots

Cumulated frequency plot Residual plot

Krouwer plot (folded cum. frequency) [Contingency tables]

Normal probability (Rankit) plot

Box & Whisker plot

Run-sequence plot (Control charts)

Lag-plot

Overview

Page 13: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 13

Selected analytical problems and associated statistics

Analytical problem Associated statistics$

Method evaluation/validation (in-house)

General Basic statisticsOutlier tests (e.g., Grubbs)

Imprecision F-test; CHI2-test (#), ANOVA

Limit of detection Probability & Power

Linearity Regression, ANOVA

Calibration Regression & correlation

Sample trueness/bias (recovery) t-tests (#)

Accuracy (uncertainty) of result Variance propagation

Method comparison Regression & correlation

Trouble-shooting Power (sample size calculations)

$Pure measurement variation, usually, is assumed Gaussian#Alternative: confidence intervals

Collaborative trials (n >2 laboratories)

Imprecision Cochran C; Bartlett

Bias ANOVA, Model I

Estimation of variance ANOVA, Model II

Interpretation of analytical results

Depends on the problem Various (see above)

Tests for distribution

Data transformation

Bayesian statistics

Overview

Page 14: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 14

How to use this book

This book is an introductory text to basic statistical and graphical techniques used in the analytical laboratory.

It is accompanied by EXCEL-files that should facilitate self-education by-demonstrating the statistical & graphical possibilities of EXCEL-explaining statistical concepts with dynamic worksheets-providing examples for creating user-specific templates

The use of the EXCEL-files is indicated by the following icons:

The general layout is shown in the figure below.

How to use this book & the EXCEL-files

Page 15: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 15

How to use the EXCEL-files

EXCEL-SettingsThe files have been tested with EXCEL 2000 & EXCEL XP.Activate the AddIns: Analysis ToolPak & Analysis ToolPak -VBA.Macro security: Medium or low.When opening the files choose "Enable Macros"The nicest view is in the "Full Screen" modeNotes: Make a back-up with a different name; do not save changes.

Features of the EXCEL-files-Easy "click-through" navigation between the worksheets-Information icon: gives information about the intention of the file-Note icon: draws attention to particular EXCEL or other issues-Exercise icon: gives instructions for interactive worksheets; additionally, the worksheets give detailed information of how to perform certain exercises.-Comment-cell: contains important information to specific topics-Many files contain dynamic elements for user interaction

How to use this book & the EXCEL-files

Page 16: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 16

How to use the EXCEL-files

CAVEPlease close other applications.During extensive use with Windows 98, it may be necessary to delete Windows>Temp files every now and then (otherwise, EXCEL may shut down).

The EXCEL-files will guide the user through the statistical functions of EXCEL that are available through the fx-icon and the "Data Analysis" AddIn. A summary of the statistical functions of EXCEL is given in the file EXCEL-StatFunctions.

The file StatTables-EXCEL contains statistical tables that are created with the EXCEL-functions• NORMSINV (z-table)• FINV (F-table)• TINV (t-table)• CHIINV (Chi2-table)

The EXCEL-files are of • tutorial nature (explaining the statistical concepts)• practical nature (templates for use)

Legal noticeThe EXCEL-files are for educative purpose. They should not be regarded as commercial software. They have been prepared with utmost care but it cannot be excluded that they may contain an error. The author is not liable for errors.

How to use this book & the EXCEL-files

Page 17: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 17

Data

• Types of data & types of statistics• Exemplary laboratory data

–"Repeated weighing"-experiment–Adult serum triacylglycerides

Data presentation (univariate data)

• Importance of digits• Table• Graphics with EXCEL®

–Dot-plot–Histogram–Frequency polygon–Dynamic histogram–Box and whisker plot

• Time-indexed plots

Data, data presentation & data description

Datasets; Data&DataPresentation

Page 18: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 18

Types of data & types of statistics

Types of dataTo correctly apply statistical techniques, we first have to understand the type of data we are dealing with (see Table below).

QUANTITATIVE ("numerical")Measured ("continuous")$ Counted ("discrete")• Blood pressure Number of• Height • … childrens in a family• Weight • … cases of aids in a city

CATEGORICALOrdinal ("ranked") Nominal(ordered categories, (unordered categories)usually based on a measure)• Grade of cancer • Sex (male/female)• Better, same, worse • Alive/Dead• Disagree, neutral, agree • Blood group

$Maybe converted to nominal by "cutoffs" (normotension; hypertension)Statistics at square one (10th ed). Swinscow, Campbell. BMJ Books, 2002

Types of statisticsWhen we know which type of data we are dealing with, we still have to know (or make assumptions) about the probability distribution of the data to apply the correct type of statistics (>Parametric-/>Non-parametric statistics; >Bayesian statistics). Identification of the type of distribution can be done with graphical techniques (>Exploratory Data Analysis) and formal statistical testing.

Parametric statisticsParametric methods for statistical hypothesis testing assume that the distributions of the variables being assessed have certain characteristics (usually, a "normal" distribution is assumed).

The basic assumption of normality of distributions relies on the assumption of many independent additive factors as responsible for a dispersion. Parametric techniques usually involve squared measures, e.g. the standard deviation is computed from sums of squared deviations from the mean.

The basis of squaring is properties of the normal distribution that renders squaring the optimal (most effective) estimation technique.

With regard to real distributions, the squaring principle makes parametric approaches sensitive towards the presence of outliers.

Testing for outliers should always be considered in parametric testing.

Data, data presentation & data description

Page 19: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 19

Types of data & types of statistics

Non-parametric statisticsNon-parametric (or distribution-free) methods for statistical hypothesis testing make no assumptions about the frequency distributions of the variables being assessed.

Bayesian statisticsStatistics which incorporate prior knowledge and accumulated experience into probability calculations.Statistics that uses subjective probability as a starting point for assessing a subsequent probability.

Exploratory Data AnalysisExploratory data analysis is a term used to describe a group of techniques (largely graphical in nature) that sheds light on the structure of the data.

Without this knowledge the scientist, or anyone else, cannot be sure they are using the correct form of statistical evaluation.

Before applying statistics, data should be plotted.

Data types & typical statisticsOften, certain types of data are related to certain types of statistics. Some of the most common cases are presented below.

Quantitative continuous data (parametric)• Descriptive statistics and confidence interval of a mean. • Confidence interval of a standard deviation. • Grubbs' test to detect an outlier. • t-test to compare two means.• Analysis of variance (ANOVA).

Ranked data (non parametric)• Mann-Whitney U• Kruskal-Wallis one-way ANOVA• Wilcoxon signed ranks• Sign test• Kolmogorov Smirnov

Categorical data• Chi-square (compare observed and expected frequencies). • Binomial and sign test (compare observed and expected proportions). • Fisher's and chi-square (analyze a 2x2 contingency table). • Predictive values from sensitivity, specificity, and prevalence.

Data, data presentation & data description

Page 20: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 20

In the laboratory, we mainly deal with measured data. The distribution of these data can be dominated by the laboratory manipulation itself (e.g., pipetting, repeated measurement of the same sample) or by the analyte (e.g., biological variation). In the first part of the course, 2 data-sets will be given that represent these 2 cases (weighing; biological variation of serum triacylglycerides).

Data-set 1 (weighing) Data-set 2Gravimetric control of a pipetted volume Serum triacylglycerides

in adult malesThe experiment• Pipet: 20-200µl variabel• Pipetted nominal volume: 100µl• 21 Pipettings (n = 21)• Balance: Readability: 0,01 mg

Other data-setsOther data-sets can be found in the file "Datasets". It will be made use of at other places in the book.

Creation of data-setsThe file "DataGeneration" explains how to generate• Random, univariate data• Bivariate data with constant SD• Bivariate data with constant CV• log-Normal distributed data

It should be used after the book has been worked-through.

Exemplary laboratory data

Data, data presentation & data description

Page 21: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 21

It is important to know that the quality of our data may depend on the number of reported digits.

Adapting digits with EXCEL®

1st option• The decrease/increase decimals iconDecreases the decimals visually,but still uses them for calculations

2nd option• Tools>Options>Calculation>Precision as displayed:• Then decrease decimalsThe data are rounded and the decimals are lost!Afterwards, deactivate field again!

We round the data of the weighing experiment:Follow the instructions given in the EXCEL-sheet "Digits"1. Tools>Options>Calculation>Precision as displayed2. Select gray cells, reduce to 2 fewer digitsAfterwards, UNCHECK IT!The rounded data don't reflect the spread of the original data anymore

• Report your data with sufficient digits, adapted to measurement precision!

Importance of digits

Data, data presentation & data description

Data&DataPresentation (Worksheet "Digits")

.0.00

05

10152025

99

,00

99

,25

99

,50

99

,75

10

0,0

01

00

,25

10

0,5

01

00

,75

10

1,0

0

Bin

Fre

qu

en

cyWeighing Weight (mg) Weighing Weight (mg)

1 99.92 1 1002 100.23 2 1003 99.50 3 100… … … …10 100.39 10 10011 100.23 11 10012 99.25 12 9913 100.28 13 10014 100.25 14 100… … … …19 99.83 19 10020 100.05 20 10021 100.22 21 100

Page 22: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 22

When we have created data, it is important to present them in easily comprehensible forms (>Tables; >Graphs). We use the weighing data for exercise.

Table (weighing experiment)• Try to describe the data (center, maximum, minimum, etc).

We note: Tables are difficult to "read"! Sorting may help, but keep the sample number & the result together!

Sorted!

Remark "Single column data" (such as the weighing data), are also called univariate data.

Sorting by weight1. Select gray cells2. Data>Sort: Follow "Print Screen"

We have seen that tables "are difficult to read". Try a picture (graph)

Data presentation

Data, data presentation & data description

Data&DataPresentation (Worksheet "Dataset")

Weighing Weight (mg) Weighing Weight (mg)1 99.92 12 99.252 100.23 3 99.503 99.50 16 99.504 100.27 9 99.605 100.22 19 99.836 100.01 1 99.927 100.18 6 100.018 100.04 8 100.049 99.60 17 100.0410 100.39 20 100.0511 100.23 18 100.1312 99.25 7 100.1813 100.28 5 100.2214 100.25 21 100.2215 100.44 2 100.2316 99.50 11 100.2317 100.04 14 100.2518 100.13 4 100.2719 99.83 13 100.2820 100.05 10 100.3921 100.22 15 100.44

Page 23: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 23

Graphs (Exploratory Data Analysis – "EDA")

Graphs are particularly useful for presenting data in summarized form and for "shedding light" onto the structure of the data.

Most useful for univariate data are >Dot-plots, >Histograms, >Box-and-whisker plots, and the >"Normal probability" plots.

First, these types of plots will be described, followed by the EXCEL-exercises.

Plots for univariate dataNote: can also be "derived data" from bivariate data (e.g., differences)

Dot plotThe dot plot presents the distribution of a variable (Yi) (usually in y-axis) in a category (usually x-axis). Data point coordinates are [Category; Yi]. Equal values are usually visualized by an offset.

Use: Visual summary of data and data distribution. The dot plot can show: center (i.e., the location) of the data;spread (i.e., the scale) of the data;skewness of the data; presence of outliers;and presence of multiple modes in the data.

Data, data presentation & data description

0

20

40

60

80

100

120

140

Sample A

Va

lue

Page 24: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 24

Plots for univariate data

Histogram (Frequency polygon#)Histograms (the term was first used by Pearson, 1895) present the frequency distribution of a variable in columns drawn over class intervals (bin). The heights of the columns are proportionalto the class frequencies. Coordinates are [Bin center Xi; frequency Yi]. *Bin = Midpoint: 85; Range: 80 – 90; Results in range: 2.

Use: Visual summary of data and data distribution. The histogram can show:center (i.e., the location) of the data;spread (i.e., the scale) of the data;skewness of the data; presence of outliers; and presence of multiple modes in the data.#Frequency polygon: the midpoints of the top of the columns are connected by a line (columns are not shown). Coordinates example: [55;1], [65;0], [75;0], [85;2], [95;7], etc. Box & Whisker plotIn box plots (this term was first used by Tukey, 1970),the central tendency (e.g., median or mean),and the range or variation statistics (e.g., quartiles)are computed and presented as a "box".The whiskers outside of the box representa selected range (e.g., 10% & 90%; here: the full range). Outlier data points can also be plotted. Coordinates are [Category; Particular Y].

Use: Visual summary of data distribution. The box and whisker plot can show:center (i.e., the location) of the data; spread (i.e., the scale) of the data; skewness of the data; presence of outliers. It is particularly useful for detecting and illustrating location and variation changes between different groups of data.

Data, data presentation & data description

0

1

2

3

4

5

6

7

8

55

65

75

85

95

10

51

15

12

51

35

Value-Bin

Fre

qu

en

cy

0

20

40

60

80

100

120

140

Sample A

Va

lue

Page 25: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 25

Plots for univariate data

Relative cumulative probability plotValues of a distribution are ordered and therelative cumulative probability of all values up to a certain value is plotted versus that value.Coordinates are [Value i; relative cumulative probability up to & including Value i].

Use: Visual test for Normal distribution:comparison of data polygon line with cumulated Gaussian line calculated with data SD.

Krouwer plot ("folded cumulated probability")Cumulated probability plot with y-axis "folded" at probability P = 0.5, or 50% (P = 0.5 or 50% is the maximum y-value). Up to 50%, coordinates are [Value i; cumulative relative probability up to & including Value i]. Above 50%, coordinates are [Value i; 100% minus cumulative relative probability up to & including Value i].

Use: Visual test for Normal distribution: comparison of data polygon line with a "folded" cumulated Gaussian line calculated with data SD. Similar to a histogram: visual summary of data and data distribution. The Krouwer plot can show: center (i.e., the location) of the data; spread (i.e., the scale) of the data; skewness of the data; presence of outliers.Special application: Method comparison (note: concentration information is lost).

Normal probability plotCumulated probability plot with y-axis normalized to the Gaussian (or Normal) distribution. Coordinates are [Value i; z-value at Value i].

Use: Visual test for Normal distribution: data should fit a line.

Special application: Reference intervals.

Data, data presentation & data description

0.0

0.2

0.4

0.6

0.8

1.0

50 70 90 110 130

Value

Cu

mu

late

d p

rob

ab

ility

-3

-2

-1

0

1

2

3

50 70 90 110 130

Value

z-v

alu

e

0

0.1

0.2

0.3

0.4

0.5

-5 -4 -3 -2 -1 0 1 2 3 4 5

Multiple of sigma

Cu

mu

lati

ve

fre

qu

en

cy

Page 26: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 26

Graphics with EXCEL®

Dot-plots

Construct the figure below with EXCELFollow the instructions on the sheet. The adaptation of the layout requires general knowldge about "Charts". This will not be explained further. Some guidance on "Charts" is given in the Annex of the book.

NoteThe Worksheet "Dot Plot 2" contains a more advanced version of the Dot-plot. Its construction, however, is relatively complicated and requires some deeper understanding of EXCEL.

Data, data presentation & data description

Data&DataPresentation (Worksheet "Dot Plot 1")

0

20

40

60

80

100

120

140

Sample A

Va

lue

Page 27: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 27

Construct a histogram with the weighing data by• Tools>Data Analysis>Histogram• Follow the guidance in the "Print Screen"

The unmodified EXCEL® figure looks like the one below

Disadvantages• Layout not attractive (but can be modified)• "Strange" data classification ("Bins")• The "More"-bin• "Static" = does not adapt when data change

We can modify the histogram by use of the general EXCEL-commands.(see Annex)

Difficulty with histogramsNo general rule can be given for the definition of the bin-width.

Histogram with EXCEL®

Data, data presentation & data description

Data&DataPresentation (Worksheet "Histogram")

Page 28: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 28

Frequency polygon with EXCEL®

Construct the frequency polygon from the histogram• Copy the histogram in the FrequPolygon sheet• [Left] Click on the histogram• Click the chart wizard• Choose this figure type

• Go to series• Finalize the figure• [when necessary, delete series 1]

Dynamic histogram

The dynamic histogram is an elegant form of presenting your data. It adapts automatically when data are added or changed.

It uses the "array formula" Frequency in EXCEL®

You have to• Define "Bins"• Select all cells of the "Frequency Range"• Type: =Frequency(Data-cell1:Data-celln;(OR: ,)Bin-cell1:Bin-celln)

–Note: ; OR , depends on the "List Separator" (Control Panel>Regional Settings>Number)

• Press: "SHIFT" & "CONTROL", hold them, and press "ENTER"

Data, data presentation & data description

02468

10

99

,00

99

,25

99

,50

99

,75

10

0,0

01

00

,25

10

0,5

01

00

,75

10

1,0

0

BinF

req

ue

nc

y

Data&DataPresentation (Worksheet "FrequPolygon")

Data&DataPresentation (Worksheet "DynHistogram")

Page 29: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 29

Box and whisker plot with EXCEL®

Construct the Box plot according to the instructions in the Worksheet.(from: http://www.mis.coventry.ac.uk/~nhunt/boxplot.htm)The construction uses the EXCEL-functions Median, Quartile, Minumum, and Maximum. The Box-plot can be constructed by putting them in the presented order. The Figure must be finalized with some special "Figure-commands" (see Worksheet explanations & "Print-Screen" at the right).

Summary: graphs for univariate data• The dot-plot is a robust graph for small and large data sets.• The histogram is more suitable for larger data sets, however, the bin-width must be chosen adequately.• The box and whisker plot is particularly useful for lager data sets, however, it already contains some claculated statistics: it is not a pure graphical method.

Graphs are important tools for the investigation of data distribution (outliers, sort of distribution).

Data, data presentation & data description

Data&DataPresentation (Worksheet "Box Plot")

Page 30: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 30

Run-sequence plotThe run-sequence plot presents data (Yi) along one axis in the time sequence they were obtained. Coordinates are [Time or event#; Yi].Use: Presentation and investigation of time series (drift, shift, outlier).Special application: quality control.The figures below show 3 situations where randomness is violated (remember: During sorting, keep the sample number & the result together)

Lag-plot (a lag is a fixed time displacement)A lag plot checks whether a data set or time series is random or not. Random data should not exhibit any identifiable structure in the lag plot. Non-random structure in the lag plot indicates that the underlying data are not random. In the Lag-plot, Yi-n (n = usually 1) is plotted on the x-axis and Yi is plotted on y-axis.

Time-indexed plots

Data, data presentation & data description

Underlying data structure

Sinusoidal data sequence

-2

-1

0

1

2

-2 0 2

Yi -1

Yi

Page 31: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 31

Exploratory data analysis

A wealth of information about Exploratory Data Analysis can be found inNIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/

The most basic set of graphics for the investigation of a data set is the so-called "4-plot".

"4-plot"The "4-plot" consists of a• run sequence plot; • lag plot; • histogram; • normal probability plot (see later >Normal distribution).

Investigate data forLocationVariationDistributionOutliers

Data, data presentation & data description

Page 32: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 32

Notes

Notes

Page 33: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 33

Data description

Descriptive statistics• Location• Dispersion

Equations

Descriptive statistics with EXCEL®

Importance of digits

IntroductionAfter we have plotted our data, we need to characterize them quantitatively. We use, for that purpose, several different measures that are related to the location (or central tendency) and the dispersion (or variability) of the data.

NoteWe have to distinguish in the following between parameters and their statistical estimates (or “statistics”).

ParameterA parameter is a numerical quantity measuring some aspect of a population. For example, the mean is a measure of central tendency.Greek letters are used to designate parameters. Parameters are rarely known and are usually estimated by statistics computed in samples. To the right of each Greek symbol is the symbol for the associated statistic used to estimate it from a sample. Quantity Parameter Statistic Mean μ M (or Xbar)Standard deviation σ s Proportion π p Correlation ρ r

Data, data presentation & data description

GaussianDistribution

Page 34: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 34

Descriptive statistics

LocationMeasures for the location (or central tendency) of data are the mean (average), the median, and the mode.

Mean• Sum of all values divided by the number of data

Median Uneven number of data

• Value in the center Even number of data

• Mean of the 2 values in the centerMode

• Value that is observed most frequently

For symmetric distributions, the mean, median, and mode are found at the same value.For skewed distributions, those 3are found at different values.

NotesIn symmetric distributions, the mean is a good location measure. In skewed distributions, the median is a better location measure than the mean. The mode is the only location measure that can be used with nominal data.

DispersionMeasures for the dispersion (or variability) of data are the range, quartiles/quantiles, the variance, the standard deviation, the coefficient of variation, and the z-value.

Range• Maximum minus minimum

Quartiles• The lower and upper quartiles (or 0.25 and 0.75 quantiles) are the 25th and 75th

percentiles of the distribution. The 25th percentile of a variable is a value such that 25% of the values of the variable fall below that value.

Variance• Sum of the squared difference of the values from the mean, divided by the

number of data minus 1!Standard deviation (SD, or s: both are used in the course)

• Square root of the variance (see also: from duplicates)Coefficient of variation

• = relative SD in %; = 100 • [SD/mean] (%)z-value (Normalized, or normal, standard deviate)

• z = y - µ/, or z = xi - mean/s

Data, data presentation & data description

Page 35: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 35

Descriptive statistics

Equations

Data, data presentation & data description

From k duplicates

Page 36: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 36

Descriptive statistics with EXCEL®

Tools>Data Analysis>Descriptive Statistics• Follow the guidance in the "Print Screen"

The descriptive statistics function in EXCEL calculates all the measures we have seen (and more: those will be explained later)

Alternatively, the measures can be calculated individually by EXCEL using the fx icon.

Importance of digitsRounded

Coming back to the rounded weighingdata, we look at the mean and theSD of the original data setand the rounded data set:

We observe that the rounded data give different mean and SD values!Too few digits give erroneous statistical estimates!

Single formulaWeight (mg)

Mean 100,0276 100,0276Standard Error 0,070017 0,070017Median 100,13 100,13Mode 100,23 100,23Standard Deviation 0,320857 0,320857Sample Variance 0,102949 0,102949Kurtosis 0,433566 0,433566Skewness -1,09862 -1,09862Range 1,19 1,19Minimum 99,25 99,25Maximum 100,44 100,44Sum 2100,58 2100,58Count 21 21Confidence Level(95,0%) 0,146052 0,146052

Descriptive statistics

Data, data presentation & data description

0,2180,321SD99,95100,03Mean

………

100100100

100100100

Weight (mg)

100,2221100,052099,8319

99,503100,23299,921

Weight (mg)Weighing

0,2180,321SD99,95100,03Mean

………

100100100

100100100

Weight (mg)

100,2221100,052099,8319

99,503100,23299,921

Weight (mg)Weighing

GaussianDistribution (Worksheet "DescrStats")

Page 37: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 37

More data

Compare the distributions when you acquire more data:

Which distribution do you recognize?

…………………………………………………………………

Data, data presentation & data description

Page 38: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 38

Gaussian (or Normal) distribution

• "Bell-shaped" (similar to a histogram)

• Cumulated: "S-shaped"

• Cumulated & linearized

• 2-sided and 1-sided probabilities

• Inside/outside probabilities

• Probabilities at selected s (z) values

• Deviation from normality (skewness & kurtosis)

Gaussian distribution

Datasets; GaussianDistribution; NormalRankitPlot

Page 39: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 39

The Gaussian distribution is of utmost importance in analytical chemistry.The statistics involved with that distribution are called "parametric" statistics.

If we would repeat the weighing -times, we expect a distribution as represented by the line.

The normal distribution is defined by its mean and standard deviation.The standard normal distribution has a mean of 0 and a standard deviation of 1.

IMPORTANT NOTE:For the "infinite" distributions, specific symbols (= "Parameters" ) are used:• Mean = µ• Standard deviation =• Normalized (or normal) standard deviate = z

EXCEL has a function that can simulate Gaussian distributed data. The function can be accessed with TOOLS>Data Analysis>Random number generation.

The worksheet "Random" explains-how to generate random numbers-presents the result in a dynamic histogram-allows the comparison between "requested" mean and SD with the simulated "sample mean and SD". Please note that those may be different, in particular, when the sample size is low.

The Gaussian (or "Normal") distribution

Gaussian distribution

GaussianDistribution (Worksheet "Random")

Page 40: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 40

The Gaussian distribution can be presented

• In the normal way: "Bell-shaped" (similar to a histogram)

• Cumulated: "S-shaped"

• Cumulated & linearized = Normal probability plot

EXCEL® template from P Hyltoft Petersen(note: not available in EXCEL ® itself)

These worksheets use the EXCEL NORMDIST function.The "Print Screens" guide you through their application.The graphs will appear automatically.

Note: The Normal Probability Plot will be demonstrated later.

Graphical presentation of the Gaussian distribution

Gaussian distribution

GaussianDistribution (Worksheets "GaussBell"; "GaussCumul")

Page 41: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 41

Gaussian distributions – Probabilities

IMPORTANT NOTEWhen data are Gaussian distributed, we can predict the frequencies (or probabilities) of their occurrence within or outside certain distances (, or z-values) from the mean (see also Figures above).

These probabilities are used in parametric statistical calculations. They are listed in tables, but they also can be calculated with EXCEL®. Of particular importance are probabilities that are used in statistical tests (95%, 99% probabilities).

2-sided and 1-sided probabilitiesStatistics distinguish probabilities in

2-sided & 1-sided

• 2-sided probabilities: question is A different from B?• 1-sided probabilities: question(s) is A > B (A < B)?

Of practical importance are probabilities"Inside" & "Outside"

• Outside probabilities, for example, are important in internal quality control.

Gaussian distribution

Page 42: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 42

Gaussian distributions – Probabilities

Probabilities at selected (z) values

INSIDE OUTSIDE1-sided 2-sided 1-sided 2-sided

1.28 90%

1.65 95% [90 %] 5% [10 %]

1.96 97.5% 95% 2.5% 5%

2.0 97.7% 95.5% 2.3% 4.5%

2.33 99% 98% 1.0% 2.0%

2.58 99.5% 99% 0.5% 1.0%

3.0 99.87% 99.7% 0.13% 0.3%

1-sided probabilities1-sided probabilities can be expected in the presence of considerable systematic error.

At SE/RE 0.75 the probabilities become practically 1-sided (see Figure)

Gaussian distribution

1.6

1.7

1.8

1.9

2.0

0.00 0.25 0.50 0.75 1.00

SE/RE

z-M

ultip

lier .

.

-1.96 1.96

0

Value

Fre

quen

cy .

.

SE/RE = 0 >

< SE/RE = 1

Stöckl D, Thienpont LM. About the z-multiplier in total error calculations. Clin Chem Lab Med 2008;46:1648–9.

Page 43: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 43

Gaussian distributions – Probabilities

Worksheets "Probability"These worksheets demonstrate the modulation of the Gaussian distribution and the observation of probabilities:

Outside 3sThis templates demonstrates how the probabilities outside the original 3s limits (original population SD = 1) change when the population mean and/or SD are modulated. Modulation is achieved by simply clicking on the "Spinners".

Outside 1.96sThis templates demonstrates how the probabilities outside the original 1.96s limits (original population SD = 1) change when the population mean and/or SD are modulated. Modulation is achieved by simply clicking on the "Spinners".

1-sided probabilitiesThis template demonstrates the concept of 1-sided probabilities. The "Spinners" allow the movement of the z-value.1-sided probabilities are displayed at the top of the figure.Mean and SD are fixed in this example.

"Inside"-probabilitiesThis template shows the probabilities within certain distances (z-values) of the population mean.The z-value can be modulated with the "Spinnners".Mean and SD are fixed in this example.

"Outside"-probabilitiesThis template shows the probabilities outside certain distances (z-values) of the population mean.The z-value can be modulated with the "Spinnners".Mean and SD are fixed in this example.

Gaussian distribution

GaussianDistribution

Page 44: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 44

Skewness and KurtosisThe Gaussian distribution is characterized by specific frequencies of values around certain distances of the mean and it is symmetric to its mean.

Distributions observed in practice may deviate from the ideal Gaussian distribution because of:

• Skewness (left skew; right skew)–Too many data on one side (left or right)

• Kurtosis (too many, or too few data in the center)–Platykurtic (too few data in the center)–Leptokurtic (too many data in the center)

These situations are shown in the figures below, together with the respective numbers calculated by EXCEL®.

Coefficient of skewness: Cskew = [Σ(xi – xm)3/N]/SD3

Zero: symmetric distribution; Positive: skewed to the right; Negative: skewed to the left.

Coefficient of kurtosis:Ckurt = [Σ(xi – xm)4/N]/SD4 – 3

Zero: Normal distribution; Positive: Peaked distribution; Negative: Flat distribution.

Both together are used in significance tests for normality. Some of the mostly used tests are listed on the next page.

Deviation from normality

Gaussian distribution

Page 45: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 45

Testing normality

Statistical significance tests for deviation from normality• Chi-square• Kolmogorov-Smirnov• Anderson-Darling• D’Agostino-PearsonThe preferred tests are Anderson-Darling and D’Agostino-Pearson!Unfortunately, EXCEL has no in-built test for normality.Statistical tests for normality are usually not useful below sample sizes of 20 to 30.

Graphical test for deviation from normalityNormal Probability Plot (Courtesy of Per Hyltoft Petersen)The Normal Probability Plot/Rankit Plot allows visual assessment of data distribution. When data are NORMAL distributed, they should lie on a LINE. Deviations from the line indicate other distributions (e.g., skewed ones).

• A maximum of 1000 values can be entered. Please SORT the data, if neccessary.• The template foresees the transformation of the data into the logarithm (the ln-version is chosen). The 1st cell (E6) contains the formula, already. After sorting of the original data, the 1st cell should be copied down to the last entry.• The graphics are automatically produced on the other sheets.• The plots have 2 y-axes. The left y-axis is in units of z, it is linear in terms of z. The right y-axis is in units of probability (%), it is non-linear in terms of probability!

The plot shows:-The distribution of the data-The Normal model of the data with its confidence intervals-The -/+1.96 s limits of the data, corresponding to the 2.5th and 97.5th percentiles.

The cumulated percentage of the data can be read from the right y-scale.Note: The % scale is represented by a picture and the tick-marks are created by separate data series. If neccessary, adapt the location of the tick-marks by changing the value in the yellow cell (D3).

Assesment of normalityCompare the distribution with the model.

Testing normality

NormalRankitPlot

Page 46: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 46

Testing normality

Triacylglyceride example

Testing normality

NormalRankitPlot

Page 47: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 47

Calculations with logarithms

Data transformation: LogarithmsWhen the data are not normal distributed, one can try a transformation. Because, in nature, data are often log-normal distributed, logarithmic transformation of data can make them normal distributed.

Test for normality: Triglycerides (See: Datasets.xls)n = 282; Lowest value: 0.3 mmol/L; Highest value: 3.2 mmol/L; Median: 0.92 mmol/L.

CBstatAnderson Darling test: Anderson Darling test after

logarithmic (natural) transformationP < 0.01 P = 0.13 data not normally distributed data log-normally distributed

Normal Probability Plot (ln-transformed dataData are "on a line" Data are ln-Normal distributed

Testing normality

Page 48: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 48

Working with logarithms

Calculate the reference interval of a logarithmic distribution

Triglycerides

1. Transform the original data to ln2. Calculate the mean of the ln (xi) values

3. Take the anti-ln of the mean of ln (x i)

This equals the geometric mean of the original population, which is close to its median.

The anti-ln of the mean of the logged value e-0.0689 is equal to the geometricmean of the original distribution where the latter is given by [x1*x2 …Xn]1/n

The anti-ln of the SD is meaningless.

Calculation of 2.5 and 97.5% percentileMean (ln transformed) -0.0689SD (ln transformed) 0.3952.5 Percentile -0.0689 – 1.96*0.395 = - 0.84397.5 percentile -0.0689 + 1.96*0.395 = 0.7053Anti-ln of 2.5 & 97.5 perc 0.43 – 2.02 Reference interval = 0.43 – 2.02 mmol/l

AlternativeAlternatively, a non-parametric approach to the data may have been chosen.Non-parametric reference intervals can be calculated with the CBstat-software.

Calculations with logarithms

Number mmol/l ln1 0.3 -1.2042 0.32 -1.1393 0.34 -1.0794 0.38 -0.9685 0.4 -0.9166 0.4 -0.916… … …

282 3.2 1.163Median 0.92

Anti-ln (ex) 0.933 -0.069 Mean, lnEXCEL: EXP(x)

Geometric mean 0.933EXCEL: GEOMEAN

Page 49: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 49

Notes

CAVE log-transformation

Introduction of non-linearity by data transformation in method comparison and commutability studies.Stöckl D, Thienpont LM. Clin Chem Lab Med 2008;46:1784-5.

Notes

y = 1.0994x - 0.3849

0

50

100

150

200

250

300

350

0 50 100 150 200 250 300 350

Reference method (AU)

Rou

tine

met

hod

(AU

) ..

y = 0.9995x + 14.65

0

50

100

150

200

250

300

350

0 50 100 150 200 250 300 350

Reference method (AU)

Rou

tine

met

hod

(AU

) ..

y = 1.0113x + 0.0339

1

2

3

4

5

6

1 2 3 4 5 6

Reference method (lnAU)

Rou

tine

met

hod

(lnA

U) ..

y = -0.0108x3 + 0.21x2

- 0.376x + 3.0751

2

3

4

5

6

1 2 3 4 5 6

Reference method (lnAU)

Rou

tine

met

hod

(lnA

U)

..

Page 50: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 50

Sampling statistics & Confidence intervals

t-distribution (sampling distribution of means)• Confidence interval of a mean• Confidence interval of the "1.96" s-point

Chi2(2)-distribution (sampling distribution of variances)• Confidence interval of a standard deviation

Interpretation of confidence limits

Introduction

We have characterized the Normal distribution on the basis of infinite sample size. In practice, we are only able to take a finite sample size. The smaller our sample size, the more uncertain our estimates will be.

All experimental estimates have an uncertainty. The "true" value lies within a certain confidence interval around our estimate!

We investigate the sampling distribution of• Means t-distribution• Variances 2-distribution

These distributions are the basis for the calculation of confidence intervals (CI's) of experimentally determined means and variances (standard deviations).

Sampling statistics – Confidence intervals

SamplingStatistics; CI-Calculator

Page 51: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 51

The t-distribution forms the basis for the statistical treatment of means. Like with the Normal distribution for single results, the t-distribution allows to predict (with a certain probability) the location of a true population mean (µ) within a certain distance (confidence interval, CI) of the experimental mean$. The probabilities can be viewed 1-sided and 2-sided.

The formula for the calculation of the CI is:

Note: The term s/n is called the standard error of the mean (SEM).

$Note (infinite measurements or known ):If x is normally distributed with mean µ and standard deviation σ:• 95% of x observations are within µ+/-1.96 σ• 95% of xm values are within µ+/-1.96 σ/n

>When is known, one can use the z-value instead of the t-value.

Characteristics of the t-distribution (see also figure below)• The shape of the t-distribution(s) depend on n.• The t-distribution equals the normal distribution for n = • t-distributions are more peaked than the normal and have wider tails.

RemarkThe means of independent observations tend to be normally distributed irrespective of the primary type of distribution. Central limit theorem

t-distribution (sampling distribution of means)

n

stx ),(

Sampling statistics – Confidence intervals

1 = 4

= : Gauß

Page 52: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 52

Relationship confidence interval/confidence limitThe confidence interval (mean ± CI) spans from the lower to the higher confidence limit (CL): CI = - CL < mean < + CL• CI = ± t • s/n• Lower CL = - t • s/n• Higher CL = + t • s/n

The CI/CL of the mean depends• on the probability level, • on the sort of tail (1-/2-tailed, also called 1-sided, or 2-sided)• on n (, respectively)

, n, and the "sort of tail" determine the magnitude of t• the standard deviation s (also denoted SD in the book)

The expression t/n can be summarized by a factor k. Then, a CL can be calculated as k • SD. A table of k-factors is given below, as well as a graphical presentation.

Relationship between confidence limit and sample size:k-factors for the 2-sided 95% confidence limit of a mean

Confidence limit of the 1.96 s point ("centile")Like for the mean, CL's can be calculated for any other point of the Normal distribution, e.g., for the 1.96 s point.The standard error (SE) for the 1.96 s point is:

The CL of the 1.96 s point is calculated as: CL1.96s = 1.71 • Clmean

The CL of the 1.96 s point is important for• Reference intervals• The Bland & Altman interpretation of method comparison studies.

Confidence interval/limits of the mean

Sampling statistics – Confidence intervals

n2s1.96

ns

SE(1.96)222

n k(X SD)

4 1,5915 1,2426 1,04910 0,71515 0,55420 0,46821 0,45530 0,37350 0,284

100 0,198 0,0

0,2

0,4

0,6

0,8

1,0

1,2

1,4

1,6

0 20 40 60 80 100

n (from n = 4)

2-si

ded

95%

CL

(SD

un

its)

Page 53: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 53

The 2-distribution allows to predict the location of a true population standard deviation () within a certain distance (CI) of the experimental SD. The probabilities can be viewed 1-sided and 2-sided. The distribution is used to calculate CI's/CL's of SD's.

Characteristics of the Chi2(2)-distribution (see also figure below)

Confidence interval/limits of s (SD)The CI/CL of s (SD) depends:• on the probability level, (1-sided, or 2-sided, also called 1-/2-tailed)• on n (, respectively)

Calculation (2-sided; /2 = 0.025, [1-/2] = 0.975)Lower CL = SD • [(n-1)/X2

0.025(n-1)]0.5

Upper CL = SD • [(n-1)/X20.975(n-1)]0.5

Relationship between confidence limit and sample size:Factors for the 2-sided 95% confidence limit of s (SD)

Chi2(2)-distribution (sampling distribution of variances)

Sampling statistics – Confidence intervals

• The shape of the function(s) depend on n ()

• The function(s) are highly asymmetric

95%-CIs of SDs become asymmetric

nLower Upper

4 0,566 3,7295 0,599 2,8746 0,624 2,45310 0,688 1,82615 0,732 1,57720 0,760 1,461

21 0,765 1,444

30 0,796 1,34450 0,835 1,246

100 0,878 1,162

Limits (X SD)

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

4,0

0 20 40 60 80 100

n (from n = 4)

2-s

ide

d 9

5%

CL

(S

D u

nit

s)

Upper limit

Lower limit

Page 54: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 54

Interpretation of 95%-confidence limits

Confidence limits and quality specifications

The figure below shows a graphical interpretation of 95%-confidence limits versus a predefined quality specification: "10".NoteWhen comparing an estimate with a specification, usually, the confidence limits are constructed 1-sided.

1. Interpretation of the cases A – D when the specification is a limitA: "In", the specification is satisfied with 95% probability.B: Not "In" with 95% probability

• More data may helpC: Not "In" with 95% probability, but also not out with 95% probability.D: "Out"

2. Interpretation when the number characterizes a stable processIf the "number" is the typical performance of a stable process, situation C can still be accepted. C: Look at lower limit: Not "Out" with 95% probability. This situation is applied in the EP 5 protocol to investigate whether the user CV is different from the typical manufacturer CV.

Sampling statistics – Confidence intervals

Specification 101. Limit2. Typical performance

Page 55: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 55

This tutorial contains interactive exercises that demonstrate:

Sampling-The general effect of sample size on the distribution of mean & SD (single repetitions).

Var, SD, Mean-The expected distribution of variance, SD, and mean for different sample sizes (high number of repetitions).

Central Limit-The "Central Limit Theorem".

t-Distr-The effect of the degrees of freedom on the t-distribution.

Chi-square-The effect of the degrees of freedom on the Chi-square-distribution.

The worksheetsConf-IntervalCI 1.96 centileCI interpretation contain similar information as this text.

This file allows the:-Calculation of 1- and 2-sided 95% confidence intervals for mean and coefficient of variation (CV).-The comparison of experimentally observed mean and CV with a target value.

Exercises

SamplingStatistics

CI-Calculator

Page 56: With thanks to Linda M Thienpont Linda.thienpont@ugent.be Kristian Linnet, MD, PhD Linnet@post7.tele.dk Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk

Statistics & graphics for the laboratory 56

Notes

Notes