module six: outlier detection for two sample case

73
1 Module Six: Outlier Detection for Two Sample Case Two sample plot, also known as Youden’s Plot, is a scatter plot with a confidence region. Youden used it for detecting labs with unusual testing results when two samples are tested in n different lab. Youden plot is a special case of the bivariate control chart, and the idea behind is the Principal Component Analysis. In this module, we will discuss Principal Component Analysis and how

Upload: junior

Post on 13-Jan-2016

16 views

Category:

Documents


0 download

DESCRIPTION

Module Six: Outlier Detection for Two Sample Case. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Module Six: Outlier Detection for Two Sample Case

1

Module Six: Outlier Detection for Two Sample Case

Two sample plot, also known as Youden’s Plot, is a scatter plot with a confidence region. Youden used it for detecting labs with unusual testing results when two samples are tested in n different lab. Youden plot is a special case of the bivariate control chart, and the idea behind is the Principal Component Analysis.

In this module, we will discuss Principal Component Analysis and how it is applied to construct bivariate control charts and discuss the interpretations of the plot.

Page 2: Module Six: Outlier Detection for Two Sample Case

2

There are three types of laboratory testing, where two-sample plots can be applied:

• A group of labs participate in testing two similar materials using the same method. It is important to identify labs, if any, which perform extremely different from the rest in either one or both materials. A two-sample plot is used to detect the extreme labs.

This is the case studied by Youden (1954), and later extended by Mandel and Lashof (1974).

Page 3: Module Six: Outlier Detection for Two Sample Case

3

2. A participated lab tests a variety of material using the same method as the standard lab, and the testing results are compared to the standardized lab to study if any particular test is extremely different from the standard lab. A paired bivariate control chart is used to determine how good the participated lab is when compared with the standard lab. Tracy, Young and Mason (1995) used a similar approach, bivariate control chart for studying paired measurements in quality control.

3. A lab test similar two or more materials using a standard procedure on a regular bases for many days. A bivariate or multivariate control chart is used for process control. In recent decades, multivariate control charts have been developed for process control of two or more quality characteristics simultaneously. Similar situation may occur in laboratory testing process control.

Page 4: Module Six: Outlier Detection for Two Sample Case

4

The classical Youden’s Plot for Two-sample cases

The following inter-laboratory study about the percent of insoluble residue in cement reported by 29 Laboratories Row %residueA %residueB

1 0.31 0.22

2 0.08 0.12

3 0.24 0.14

4 0.14 0.07

5* 0.52 0.37

6 0.38 0.19

7 0.22 0.14

8* 0.46 0.23

9 0.26 0.05

10 0.28 0.14

11 0.10 0.18

12 0.20 0.09

13 0.26 0.10

14 0.28 0.14

15 0.25 0.13

Row %residueA %residueB

16 0.25 0.11

17 0.26 0.17

18 0.26 0.18

19 0.12 0.05

20 0.29 0.14

21 0.22 0.11

22 0.13 0.10

23* 0.56 0.42

24 0.30 0.30

25 0.24 0.06

26* 0.25 0.35

27 0.24 0.09

28 0.28 0.23

29 0.14 0.10

Page 5: Module Six: Outlier Detection for Two Sample Case

5

Using One sample Box-Plot, we have

%residueA %residueB

0.0

0.1

0.2

0.3

0.4

0.5

0.6

%In

solu

abl

e r

esi

due

0.08

0.46

0.520.56

0.25

0.37

0.42

0.14

Side-By-Side Box Plot for Material A abd B -Insoluable Residue in Cement

Quick check suggests Lab #5, 8, and 23 are likely outliers. They are excluded in the two-sample plot.

Page 6: Module Six: Outlier Detection for Two Sample Case

6

(from NIST website)

The vertical line passes through the median of the X-variable.

The Horizontal line passed through the median of the Y-variable.

The center of the graph is the intersection of the two median lines. This intersection is called the Manhattan Median.

Page 7: Module Six: Outlier Detection for Two Sample Case

7

What is the 45o line for? What is the practical meaning of this line?

Why and How Youden Plot works?

Under the absolute perfect condition, the pair of data should be all equal when testing the same material twice using the same testing procedure. However, it is never the case. In fact, there are two major components of uncertainties: The systematic error and random error.

From common experience, systematic error of a given lab should be the same or very close, in theory, when performing the test under the same condition using the same procedure. And that each lab is different from others. This suggests that, if there is no random or unexpected error, under the perfect condition, but allowing lab differences, then, the pair of data should be equal (that x=y) for a given lab, but, may be different for different labs. Translating this situation, the pairs of data are located at the 45o line passing through the Manhattan Median.

Therefore, the distance from (x,y) pairs on the 45o line is the component of the systematic error for the given lab.

How can we identify the random error component from the plot?

In reality, a pair of data points are scattered on the graph. Only very few pairs will be located on the 45o line, and even so, x and y are really the same.

Consider a pair of data point which is away from center and are not on the 45o line. The distance from the point (x,y) to the center is the total error:

Total error = Systematic error + Random error

Page 8: Module Six: Outlier Detection for Two Sample Case

8

(-7, -2)

Random Error Component(RaE): the distance from (-2,-7) to (-4.5, -4.5)

Systematic Error (SyE) component: the distance from (-4.5, 4.5) to (0,0)

The total error = 2 2(RaE) ( )SyE

Page 9: Module Six: Outlier Detection for Two Sample Case

9

(x3 , y3)

How to determine (x3 , y3)?

For simplicity, assume center is (0,0). Then we see that x3 = (x2 + y2)/2. That is the coordinate for the systematic component is at ((x2 + y2)/2 , (x2 + y2)/2 ). Hence, systematic error component is the distance to from this point to (0,0):

If the center is at (x1 , y1), then the Systematic Error component is

22 2 2 22[( ) / 2] | | / 2x y x y

2 1 2 1(| ( ) ( ) |) / 2x x y y

The distance is allowed to be negative here for reflecting the quadrant. The magnitude is positive.

Page 10: Module Six: Outlier Detection for Two Sample Case

10

Page 11: Module Six: Outlier Detection for Two Sample Case

11

Page 12: Module Six: Outlier Detection for Two Sample Case

12

NOTE: p in the formula is the Random Component, in our notation, we use: RaE.

If there is no systematic error, the uncertainty should involve only random error component. And therefore, the standard deviation of the random error components is an estimate of the random error variation. 95% coverage region is given by the radius = 2.45(s). The value 2.45 is based on the assumption the (X,Y) follows Bivariate Normal and are independent.

Page 13: Module Six: Outlier Detection for Two Sample Case

13

Youden’s Calculation of the standard deviation and the radius

Youden plot is applied to identify labs with unusual high systematic error as well as labs have unusually high random errors.

•When a lab has both large total error, the data point will be far away from the center. They are the first group of labs that should be closely investigated.

•It may happen that a a material is less sensitive to different environment than the other. When this happens, the data points will tend to be parallel to X- axis or Y-axis, with a large variation due to a material. This can also be quickly identified.

•When a lab have an unusually high systematic error, but small random component, it will scattered along the 45o line. And that more data points are in upper right quadrant and lower left quadrant.

•When a lab has a large random component, it will be far from the 45o line, that is, will have higher probability to be in the upper-left and lower-right quadrants.

Page 14: Module Six: Outlier Detection for Two Sample Case

14

In order to identify these unusual labs, labs with large systematic error, Youden suggested to draw a circle centered at the Manhattan median with the radius being a multiple of the variation due to random error.

The variation due to the random errors is the standard deviation of the random errors of labs, which is obtained by:

According to Youden (1959),

a 95% coverage probability of a circle is given by the circle with radius = 2.448(s)

A 99% coverage probability of a circle is given the cirvle with radius = 3.035(s)

The relationship between the coverage probability and the multiple b is given in Youden’s original paper in the Journal of Industrial Quality Control, 1959, p. 24-28:

Coverage probability = 1-exp(-b2/2)

2( )

1iRaE

sn

Page 15: Module Six: Outlier Detection for Two Sample Case

15

An alternative approach to compute the systematic error components, random error components and the corresponding variations.

(-7, -2)

For a given (x,y) data point, its corresponding coordinate of the systematic component is ( (x+y)/2, (x+y)/2), and the difference between (x,y) and ((x+y)/2, (x+y)/2) along the X-axis is

X – (x+Y)/2 = (x-y)/2

The difference along the Y-axis = (y-x)/2

This suggests that the random error for the data point (x,y) is (x-y)/2.

For each lab,

1. compute the systematic component

[ (x+y)) – (x0+y0)]/2, where (x0 , y0) is the median origin.

2. Compute the random error component:

(x-y)/2

3. Compute variance and s.d. for each component.

sB measures the variation of between-lab systematic errors.

se measures the variation due to random errors.

4. To construct the circle with 95% coverage probability, the radius = 2.45(se)

Page 16: Module Six: Outlier Detection for Two Sample Case

16

Page 17: Module Six: Outlier Detection for Two Sample Case

17

Page 18: Module Six: Outlier Detection for Two Sample Case

18

Activity: To construct a classical Youden’s Plot for Two-sample cases

The following inter-laboratory study about the percent of insoluble residue in cement reported by 29 Laboratories

Row %residueA %residueB

1 0.31 0.22

2 0.08 0.12

3 0.24 0.14

4 0.14 0.07

5* 0.52 0.37

6 0.38 0.19

7 0.22 0.14

8* 0.46 0.23

9 0.26 0.05

10 0.28 0.14

11 0.10 0.18

12 0.20 0.09

13 0.26 0.10

14 0.28 0.14

15 0.25 0.13

Row %residueA %residueB

16 0.25 0.11

17 0.26 0.17

18 0.26 0.18

19 0.12 0.05

20 0.29 0.14

21 0.22 0.11

22 0.13 0.10

23* 0.56 0.42

24 0.30 0.30

25 0.24 0.06

26* 0.25 0.35

27 0.24 0.09

28 0.28 0.23

29 0.14 0.10

Page 19: Module Six: Outlier Detection for Two Sample Case

19

Variable N N* Mean StDev

(A-B)/2 25 4 0.04760 0.03473

(A+B)/2 25 4 0.1816 0.0570

The radius of the circle for the 95% coverage region is .03473 x 2.45 = .085

Hence the circle has the form:

(x-.25)2 + (y-.13)2 = (.85)2

Variable N Mean Median StDev

Material A 25 0.2292 0.2500 0.0731

Material B 25 0.1340 0.1300 0.0597

Hands-on activities using Mandel and Lashof’s data

Page 20: Module Six: Outlier Detection for Two Sample Case

20

Principal Component Analysis -The concept Behind Two-Sample Plots

The idea behind the two-sample plot is the principal components and bivariate normal distribution. The following scatter plot illustrate the principal components for bivariate case.

70 80 90 100

80

90

100

X1, Tensile Strength-H

X2,

Ten

sile

Str

engt

h-E

Y1, Major PrincipalComponent

Y2, Minor PrincipalComponent

Scatter plot with Principal Compoment Coordinates

(79.81, 92.81)

Page 21: Module Six: Outlier Detection for Two Sample Case

21

The scatter plot is from an inter-laboratory study in Mandel & Lashof (1974). The data are tensile strength of rubbers using two different materials, and testing in 16 laboratories. Laboratory Strength-E (X2) Strength-H (X1)

1 94 80

2 103 82

3 94 77

4 99 83

5 97 86

6 91 76

7 91 81

8 102 98

9 98 83

10 91 81

11 93 82

12 82 69

13 93 81

14 82 72

15 83 73

16 92 73

Page 22: Module Six: Outlier Detection for Two Sample Case

22

Y1 and Y2 are new coordinates.

•Y1 represents the direction where the data values have the largest uncertainty.

•Y2 is perpendicular to Y1.

•They intersect at the sample averages = (79.81, 92.81) .

To find Y1 and Y2, we need to make transformation from X1 and X2. To simplify the discussion, we move the origin to and redefine the (X1,X2) coordinate as

x1 = X1 - , x2 = X2 - , so that the origin is (0,0).

The relationship is illustrated in the following graph. We would like to present the data of a given lab, p = (x1,x2) in terms of p = (y1,y2). From basic geometry relations, we see:

y1 = (cos) x1 + (sin) x2

y2 = (-sin) x1 + (cos) x2

1 2( , )x x

y2

x1

x2

1 2( , )x x

1x 2x

y1 p

The angle is determined so that the observations along the Y1 axis has the largest variability. But HOW?

Page 23: Module Six: Outlier Detection for Two Sample Case

23

The transformation from (x1,x2) to (y1,y2) results several nice properties

1. The variability along y1 is largest.

2. Y1 and y2 are uncorrelated, that is, orthogonal.

3. The confidence region based on (y1,y2) is easy to construct, and provide useful interpretations of the two sample plots.

Questions remain unanswered are

1. How to determine the angle so that the variability of observations along the y1 axis is maximized?

2. How to construct the ellipse for confidence region with different levels of confidences?

3. How to interpret the two-sample plots?

Page 24: Module Six: Outlier Detection for Two Sample Case

24

How to determine the y1 and y2 axis so that the variability of observations along the y1 axis is maximized and y2 is orthogonal to y1?Rewrite the linear relation between (y1,y2) and (x1,x2) in matrix notation:

y1 = (cos) x1 + (sin) x2

y2 = (-sin) x1 + (cos) x2'

1 1 2 1 1

'2 1 2 2 2

(cos ) (sin ) cos sin

( sin ) (cos ) sin cos

y x x x A XY AX

y x x x A X

NOTE: X is bivariate , so is Y, and

V(X) = , V(Y) = A’V(X)A =

and are called the eigen values. Which are the solutions of

And, V(Y1) = V(Y2) = Correlation between Y1 and Y2 = 0.

1

2

0

0

1 1 2

1 2 2

( ) ( , )

( , ) ( )

V x Cov x x

Cov x x V x

( ) 0V X I

Page 25: Module Six: Outlier Detection for Two Sample Case

25

and are called the eigen values. Which are the solutions of

And, V(Y1) = V(Y2) = Correlation between Y1 and Y2 = 0.

The angle = if , when , = 45o

Note the angle depends on the correlation between X1 and X2 , as well as, on the variances of X1 and X2, respectively.

• When is close to zero, the angle is also close to zero. If V(X1) and V(X2) are close, then, the scatter plots are scattered like a circle. That is, there is no clear major principal component.

•When is close to zero and V(X1) is much larger than V(X2), then, the angle will be close to zero, and the data points are likely to be parallel to the X-axis. On the other hand, if V(X1) is much smaller than V(X2), the angle will be close to 900, and the data points will be more likely parallel to the Y-axis.

2

221

212arctan)5(.

21

21

211arctan

Page 26: Module Six: Outlier Detection for Two Sample Case

26

Consider, now, we actually observe the following two sample data:

11 21

12 22

13 23

1 2n n

x x

x x

x x

x x

The sample means are given by

The sample variance-covariance matrix is given by 2

1 1 2

21 2 2

s rs s

rs s s

r is the Pearson’s correlation coefficient, and S2 is the sample variance. S is the sample standard deviation.

1

2

x

x

V(Y) is the solution of

The solutions for are given by

ˆ( )V X

ˆ( ) 0V X I

2 2 2 2 2 2 2 21 2 1 2 1 2( ) ( ) 4(1 )

2

s s s s r s s

NOTE: V(Y1) + V(Y2) = s12 + s2

2 = V(X1) + V(X2)

Page 27: Module Six: Outlier Detection for Two Sample Case

27

2

221

212arctan)5(.

ss

srs

21 ss

Using the sample data, the angle is estimated by

21

211arctansrs

s

Case Example

We know use the Tensile strength data to demonstration the computation of principal components and related sample information.

For the Tensile Strength Example, X1 is the material H and X2 is the material E.

The number of labs, n= 16.

Using Minitab, we can obtain the following information:

Page 28: Module Six: Outlier Detection for Two Sample Case

28

Variance-Covariances Matrix:

H E

H 46.4292 35.0292

E 35.0292 40.9625

Correlations = .8031

Principal Component Analysis: Tensile Strength-H, Tensile Strength-E

Eigen values are: 78.831 and 8.560, the solutions of

Linear Coefficients between (Y1, Y2) and (X1,X2)

Variable Y1 Y2

H 0.734 -0.679

E 0.679 0.734

These are the coefficients for y1 = (cos) x1 + (sin) x2

y2 = (-sin) x1 + (cos) x2

The angle =

= arctan[(78.831-46.4292)/35.0292] = 42.770

21 1 2

21 2 2

s rs s

rs s s

ˆ( )V X

ˆ( ) 0V X I

21

211arctansrs

s

Page 29: Module Six: Outlier Detection for Two Sample Case

29

The sample means from the sample data are

Variable N Mean

H 16 79.81

E 16 92.81

In terms of (Y1, Y2), the means are

Variable N Mean

Y1 16 121.61

Y2 16 13.937

70 80 90 100

80

90

100

Tensile Strength-H

Te

nsile

Str

eng

th-E

Two-sample Scatter PLot for the Tensile Strength Data

Two sample scatter plot is

Page 30: Module Six: Outlier Detection for Two Sample Case

30

Confidence Region for two-sample Plots

Each of the X1 and X2 can be treated as a univariate variable. In most cases, we consider each variable follows a normal distribution. The rules we introduced for one variable case do assume that each variable follows a normal distribution. We can apply outlier detecting methods for each variable. When we consider two variables simultaneously, [X1, X2] are bivariate, and the distribution for [X1,X2] is taken to be bivariate normal distribution. Because of this extension, we are able to construct ellipses that works similar to empirical rule. We can construct several ellipses so that the probability of having the pair of data inside the ellipse is .95 or .99 and so on.

The construction of the ellipse can be simplified when we use the principal components as described above. And the interpretations based on the principal components are very useful.

Page 31: Module Six: Outlier Detection for Two Sample Case

31

Bivariate Normal Distribution and it’s application in two-sample plots

Because the ellipse region relies on bivariate normal distribution, we briefly give an introduction of the bivariate normal in the following.

The bivariate normal distribution of X1 and X2 has the form:

f(x1, x2) = (2exp(-Q/2)

Where ])())((2)(

[1

122

222

21

221121

211

2

XXXXQ

We usually use the notation :

),(~221

2121

2

1

2

1

BNX

X

2

1

is the mean vector.

221

2121

is the variance-covariance matrix.

Page 32: Module Six: Outlier Detection for Two Sample Case

32

A ellipse Q = c, c>0 centered at can then be created in the (X1, X2) coordinate. .

The shape and the orientation of the ellipse is determined by the values of

and , and its size is determined by the choice of the constant c. The choice

of the constant c can be determined based on the level of confidence using the bivariate normal distribution.

When we collect two samples, the sample data provide sample means and sample variance-covariance matrix.

The sample means are given by

The sample variance-covariance matrix is given by 21 1 2

21 2 2

s rs s

rs s s

r is the Pearson’s correlation coefficient, and S2 is the sample variance. S is the sample standard deviation.

1

2

x

x

ˆ( )V X

2

1

Page 33: Module Six: Outlier Detection for Two Sample Case

33

When replacing the population parameters by the corresponding sample information, we obtain the Hotelling’s T2:

2 22 1 1 1 1 2 2 2 2

2 2 21 1 2 2

1 ( ) 2 ( )( ) ( )[ ]

1

X x r X x X x X xT

r s s s s

T2 is distributed as

Which is a multiple of an F-distribution. The ellipse region is now given by

T2 = c*

The constant c* is determined using the F-distribution described above. The corresponding 100(1-)% percentile is from the F distribution with degrees of freedom (2, n-2).

For example, when participated labs, n = 16, then a 95% percentile of

F(2, n-2) = F(2,14) = 3.74. Therefore, c* = 2(255)/224 x (3.74) = 8.515

A 95% ellipse region can then be constructed using T2 = 8.515

)2,2(

2

)2(

)1(2

nF

nn

n

Page 34: Module Six: Outlier Detection for Two Sample Case

34

How to construct a 100(1-)% region in a two-sample plot – the Youden’s Plot?

Under the (X1,X2) coordinate, a general form of an ellipse is given by :

Under the principal component coordinate and move the center to (0,0), the ellipse has the simple form in terms of (y1, y2):

1)())((2)(

22

22

12

2211

11

211

a

xX

a

xXxX

a

xX

122

22

11

21

b

y

b

y

The distance from the center, (0,0) to the ellipse curve along the major principal component is is , and the distance from the center (0,0) to the curve along the minor principal component is

22b

11b

Page 35: Module Six: Outlier Detection for Two Sample Case

35

When matching the mathematical form with the statistical form of confidence ellipse region: T2 = c*, it is seen that, T2 = c* can be expressed in terms of the general form above.

Confidence ellipse region in terms of the principal components has a simple form:

1*

2

22

*1

21

c

y

c

y

*

1( ,0)c

*2(0, )c

y1

y2

*2(0, )c

*1( ,0)c

*1 2( ( ) ,0)c *

1 2( ( ) ,0)c

The sum of the distance from foci to curve is constant, and equal to

Hence, for any point on the ellipse (y1,y2), when y1 is given, y2 can be computed by

*12 c

* 222 2 1

1

y c y

Page 36: Module Six: Outlier Detection for Two Sample Case

36

The sum of the distance from foci to curve is constant, and equal to

Hence, for any point on the ellipse (y1,y2), when y1 is given, y2 can be computed by

And, the original scale in terms of (x1,x2) at the center (0,0) is given by

x1 = (cosy1 – (sin)y2

x2 = (sin)y1 + (con)y2

Shifting the center from (0,0) back to , we have

*12 c

* 222 2 1

1

y c y

1 2( , )x x

1 1 1 2

2 2 1 2

21 1 1 2

(cos ) (sin )

(sin ) (cos )

where = arctan[( ) /( )

X x y y

X x y y

s rs s

Page 37: Module Six: Outlier Detection for Two Sample Case

37

We are now ready to construct a 100(1-)% confidence ellipse region for the Tensile Strength data.

In terms of the principal component coordinate (y1, y2) at the center = (0,0).

A 95% coverage region has the region covered by the ellipse 1*

2

22

*1

21

c

y

c

y

Which is given by

2 21 2( ) ( )

178.831 8.515 8.56 8.515

Y Y

*1c *

2c= 25.908 = 8.537*

1 2( )c = 24.461

The vertices are (-25.908, 0) and (25.908, 0)

The foci are (-24.461, 0) and (24.461, 0)

The Vertical axis are intersected at (0, 8.537) and (0, -8.537)

The sum of the distances from two foci to the curve is 51.816

Page 38: Module Six: Outlier Detection for Two Sample Case

38

140130120110

20

15

10

First Component

Seco

nd C

om

pone

nt

Score Plot of Tensile-Tensile

140130120110

20

15

10

First Component

Se

cond

Co

mp

one

nt

Score Plot of Tensile-Tensile

Page 39: Module Six: Outlier Detection for Two Sample Case

39

To construct the ellipse curve, the point on the curve is given by:

* 2 221 2 1 2 1 1 1

1

( , ) ( , ) ( , 72.8884 .1086 )y y y c y y y

The original scale of the ellipse curve on the (X1, X2) is given by

1 1 1 2 1 2

2 2 1 2 1 2

(cos ) (sin ) 79.81 .734 .679

(sin ) (cos ) 92.81 .679 .734

X x y y y y

X x y y y y

Using the original scale of (X1, X2), the 95% coverage region is

2 21 1 2 2( 79.91) ( 79.81)( 92.81) ( 92.81)

1140.36 82.08 123.833

X X X X

Page 40: Module Six: Outlier Detection for Two Sample Case

40

The two-sample plot with the 95% coverage probability for the tensile strength data is given by:

Raw Data

95% Ellipse

60 70 80 90 100

80

90

100

110

Material H

Ma

teri

al E

Two-Sample PLot for the Tensile Strength Data- Materal H Vs Material E

We notice that one lab is outside of the 95% coverage region. This is Lab 8 with two-sample results (98,102)

Page 41: Module Six: Outlier Detection for Two Sample Case

41

Marginal Plots as an supplement for detecting outliers in two-sample case

The two-sample plot (Youden’s plot) takes the correlation between two-samples into account and are assumed to follow a bivariate normal distribution. This is a useful plot for detecting outliers. In addition to the two-sample plot, one have introduced box-plot for each sample. A two-dimensional marginal box-plot is a good addition for detecting outliers when we sue two-sample Youden’s plot. Using the Tensilt strength as an example, we use Minitab to construct the marginal box-plot in the following.

Page 42: Module Six: Outlier Detection for Two Sample Case

42

65 70 75 80 85 90 95 100

80

85

90

95

100

105

Tensile Strength-H

Te

nsile

Str

eng

th-E

Scatter Plot between Material H and Ewith Marginal Box PLots

The horizontal Box plot is the box-plot for Tensile Strength of Material H. The vertical Box plot id for Material E. It is noticed that there is a clear outlier when testing Material H based on the marginal box plot. Lab 8 is the outlier lab when testing Material H. A similar finding was obtained using Two-sample plot.

Page 43: Module Six: Outlier Detection for Two Sample Case

43

How to construct Two-sample Plot using Minitab?

Minitab does not have a procedure on the pull-down menu. However, a Minitab macro program, similar to what we call a subroutine, has been written for two-sample plots. I have added this macro into the Minitab Macro collection. The name of the macro is BCC (stands for Bivariate Control Chart). There are some restrictions when applying this macro.

Preparation of data:

1. Enter sample One in C1, which will be on the X-axis.

2. Enter sample Two in C2, which will be on the Y-axis.

3. Obtain the critical F-value of the F-distribution using any Statistics book or using Minitab: F(, 2, n-2), where n is the sample size with (1- being the coverage probability. For example, a 95% coverage region gives = .05.

Page 44: Module Six: Outlier Detection for Two Sample Case

44

For the Tensile Strength inter-laboratory study, the number of participated labs = n = 16.

For, a 95% coverage region in the Two-sample plot, the critical value is F(.05, 2,14) = 3.74 from an F-table. The two degrees of freedom are ‘2’ for numerator d.f., and n-2 = 14 for the denominator d.f..

4. Use Minitab to determine F(.05,2,14):

• Go to Calc, choose Probability Distributions, then select ‘F’.

• In the F-distribution Dialog box, click on ‘Inverse cumulative probability.

• Enter d.f. 2 and 14, respectively.

• Click on Input Constant and enter .95.

5. Make the Session window an active window. Then, go to Editor Menu, and click on ‘Enable Commands’.

Page 45: Module Six: Outlier Detection for Two Sample Case

45

6. In the Session window, you should see ‘MTB>’ appears. We will use Minitab commands to input the F(.05, 2,14) and run the BCC Macro.

7. In the Session window, enter the following two commands next to ‘MTB>’ :

MTB> let k11 = 3.74

MTB> %BCC

The ‘let’ statement defines a minitab constant: k11 = F(.05,2,14)

The ‘%’ before the macro name ‘BCC’ is to identify ‘BCC’ as a minitab macro.

The results from the %BCC macro are a Two-sample plot with both scatter plot and the 95% ellipse region, and some principal component computations stored in the worksheet.

Page 46: Module Six: Outlier Detection for Two Sample Case

46

How to use Minitab to construct Marginal Box plot?

Marginal box plot is available in the Graph Menu.

1. Go to Graph, choose Marginal Plot.

2. In the Dialog box, enter Y-variable, and X-variable.

3. You can choose different types of marginal plots, including Histogram, Dot-lot and Box plot. Choose the one you would like to construct.

4. Click on the ‘Symbol’ selection, you can define the type of symbols for the scatter plot.

Page 47: Module Six: Outlier Detection for Two Sample Case

47

How to use Minitab to conduct a Principal Component Analysis?

In my discussion of constructing principal components, and angle, the relationship between (X1, X2) and the principal components, I have used Minitab to obtain these results for the Tensile Strength example. Here is how Minitab does the analysis.

1. Go to Stat, choose Multivariate, then select Principal Components.

2. In the Dialog box, enter variables X1 and X2. Choose the Type of Matrix to be Covariance.

3. Click on the ‘Graph’ , then select ‘Score plot for first two components’.

4. Click on the ‘Storage’, you can choose to store the coefficients for computing (y1,y2) using (x1,x2)(that is cosand sin),and store the values of (Y1, Y2). For example, store Coefficients in C5 and Store (Y1,Y2) inC6, C7.

Page 48: Module Six: Outlier Detection for Two Sample Case

48

How to Interpret Two-sample plots?

In a typical inter-laboratory testing study, the pairs of samples usually are of different, but similar materials are sent out periodically to a number of participating laboratories. All labs run a predetermined number of replicate measurements on each sample for a number of properties.

•For each property, the pair of two samples can be plotted as a Youden Plot or more generally, a bivariate control chart.

•Generally speaking, the points on the plot fall within an ellipse region, which may be defined as a 95% coverage region or 99%.

•The major axis (the major principal components) of the ellipse is approximately 45o.

•The length of the principal component ( based on previous discussion) is related to the ‘between laboratory variability’, and the length of the second principal component ( ) to the ‘material-laboratory interaction’.

*12 c

*22 c

Page 49: Module Six: Outlier Detection for Two Sample Case

49

A General Statistical Model for interpreting the possible source of variability

NOTE:

• If the between-lab variability is not homogeneous for two samples, the major principal component of the ellipse may not be 45o.

• In addition, the pattern and interpretation of the plot of the plot heavily depends on the model for the experiment and the source of variability in both samples and their inter-relationships.

• A general statistical model that includes important sources of variability is necessary in order to provide adequate interpretation of Youden’s two-sample plot under different assumptions of the sources of variability.

• Youden emphasizes that the two samples should be similar and reasonably close in the measurement of the property evaluated.However, this may not hold in inter-laboratory studies. A more general situation is to consider the following possibilities.

Page 50: Module Six: Outlier Detection for Two Sample Case

50

1. Material A and B have different population averages, and

2. Each lab has a systematic effect, denoted by LiA, and LiB, respectively.

3. The property is measured with an unexplainable random error from each laboratory, denoted by iA and iB, respectively.

The observed measurement from each lab , denoted by XiA and XiB can be expressed by the statistical model:

XiA = + LiA + iA , i = 1,2, 3, ----, n

XiB = + LiB + iB , I = 1,2,3, ---, n

NOTE that the systematic lab effects is assumed dependent only on the population of the material, not on the material itself. In other words, for different material with the same population average, the systematic lab effect remain the same, even though the material may be different.

It is important to realize that the interpretation of Youden plot depends critically on the specific assumptions on LiA, LiB, iA and iB.

When data are observed, and are estimated by the corresponding sample averages: , and the model is center to (0,0):

and A Bx x

Page 51: Module Six: Outlier Detection for Two Sample Case

51

xiA = XiA - = LiA + iA

xiB = XiB - = LiB + iB

AxBx

Expressing the variance-covariance of (x1, x2) in terms of the variance components of the systematic effects and random error, we obtain, under the assumption that systematic effects are independent of the random error, and the random errors are independent from lab to lab:

21 1 2

21 2 2

s rs s

rs s s

ˆ( )V X 2 2

2 2A A A B

A B B B

L R L L

L L L R

s s rs s

rs s s s

=

We can obtain the variance-covariance estimates from the data, and then, to estimate the variance component of each source of variation.

However, some assumptions are needed in order for us to do so, since we have three equations with five unknowns.

Page 52: Module Six: Outlier Detection for Two Sample Case

52

When we impose any assumptions, it is important to keep in mind that these assumptions should reflect the properties of populations underline the experiments. Considering an inter-laboratory experiment using two materials, A and B which can be described in terms of the model

XiA = + LiA + iA , i = 1,2, 3, ----, n

XiB = + LiB + iB , I = 1,2,3, ---, n

• It is often that the averages of two materials are not the same, that is

• The random error from each sample may have different variability (not homogeneous), that is, variance of random error for material A may not be the same as that for material B.

• There are situations where we can demonstrate the equal variance for the random errors of two material, but the systematic biases between two material may not be the same for different labs.

Page 53: Module Six: Outlier Detection for Two Sample Case

53

Based on the above scenarios, the following three types of model can be considered, and the variance components can be estimated for each type of model.

• One may consider the situation by assuming the systematic effects, the laboratory effect, are the same because of the standardized procedure is implemented, that is, LiA = LiB = L (additive model)

• A less restrictive assumption would be: The laboratory biases are in proportional relationships as so that LiA= K(LiB) (concurrent model).

• Foe each of above models, if we can demonstrate homogeneity of variances of random errors of two materials, we can assume: V(V(for all labs, (homogeneous variance).

Page 54: Module Six: Outlier Detection for Two Sample Case

54

Under the additive model: we obtain the following variance component estimate for each source of variation:

2122

221

21

221

2

221

2222

2221

, ,

: are components variance theTherefore,

, ,

srssssrssssrss

ssrsssssss

BA

BA

RRL

LRLRL

Under the concurrent model, the proportion is k = LA/LB. The variance component estimates are:

2122

2122

212

221

2222

22221

21 , ) ,

: are components variance theTherefore,

, ,

ssk

rssskrsssss

k

rs

kssrssssssks

BAB

BBBAB

RRL

LRLRL

Page 55: Module Six: Outlier Detection for Two Sample Case

55

Under the concurrent model, the proportion ‘k’ is estimated by

The average ratio of material A and material B.

.

)( , )( ,

: are components variance theTherefore,

, ,

1222

2112

212

221

2222

22221

sk

rssskrssssss

k

rs

kssrssssssks

BAB

BBBAB

RRL

LRLRL

n

xxk iBiA )/(

The variance component estimates are:

Page 56: Module Six: Outlier Detection for Two Sample Case

56

Under the homogeneous variance for additive model, the random error variance are assumed equal, that is:

And the variance component for each source is estimated by:

222RRR sss

BA

samples. twoof covariance theminus variancesample combined theiswhich

, 2/)( ,

: are components varianceThe

estimates. precised morefor components variancethe

estimate todata sample twocombinecan weTherefore,

, ,

2122

21

221

2

221

2222

2221

srsssssrss

ssrsssssss

RL

LRLRL

Page 57: Module Six: Outlier Detection for Two Sample Case

57

21

222

212

212

221

2222

22221

2

1

2 ,

: are components variance theTherefore,

, ,

srsk

ksssss

k

rs

kssrssssssks

RL

LRLRL

B

BBB

Under the homogeneous variance for concurrent model, the variance components are:

Page 58: Module Six: Outlier Detection for Two Sample Case

58

Case One: Two samples are similar, and the property is measured under the same standard procedure.

This is the most restricted situation. Under this situation, we are assuming:

1. The averages are approximately the same, that is

2. The lab systematic effects are equal for two samples, that is,

LiA = LiB = Li

3. The random error components follow the same normal distribution. and independent.

A Bx x

2 2 2 21

2 2 2 21

2 2 2 2 2 2 2 2

( 1)( ) / 2

( 1)( ) / 2

[( ) ( )] 4

A B A B

A B A B

A B A B A B

L L R R

L L R R

L L R R L L

n s s s s M

n s s s s M

M s s s s r s s

2 2 2 2tan [( ) ( ) ] /(2 )A B A B A BL L R R L Ls s s s M rs s

The principal components of the ellipse region for the general model is given by the following:

Page 59: Module Six: Outlier Detection for Two Sample Case

59

Under this case, the variance components are simplified to:

the principal components are reduced to

2 2 2 2 2 2 and A B A BL L L R R Rs s s s s s

2 21

22

( 1)(2 )

( 1)( )

and tan 1, that is, the angle 45

L R

R

o

n s s

n s

Under this situation, the Youden plot can be interpreted by:

1. The minor principal component, 2 , measures the random error.

2. The major principal component, 1, measures the systematic effect, if the random error is relatively small.

3. The ratio 1/ 2 measures the impact of the systematic effect with respect to the random error.

Page 60: Module Six: Outlier Detection for Two Sample Case

60

Case Two: Under this situation, we are assuming:

1. The averages are approximately, that is

2. The lab systematic effects are equal for two samples, that is,

LiA = LiB = Li

3. The random error components are independent, but follow two different normal distributions with different , but, small variances relative to .

Under these conditions, we have :

A Bx x

2Ls

.45ely approximat iscomponent principalemajor th of angle theis that ,1tan

2/))(1(

2/))(1()1(2

o

222

2221

BA

BA

RR

RRL

ssn

ssnsn

The same interpretation of Case One holds, here.

Page 61: Module Six: Outlier Detection for Two Sample Case

61

Case Three: Under this situation, we are assuming:

1. The averages are approximatelyequal, that is

2. The lab systematic effects are equal for two samples, that is,

LiA = LiB = Li

3. The random error components are independent, but follow two different normal distributions with different variances:

Under these conditions, we have:

A Bx x

22222

222

2222

2221

)()2( where,

2/])[(tan

2/])(2)[1(

2/])(2)[1(

BA

BA

BA

BA

RRL

LRR

RRL

RRL

sssM

sMss

Msssn

Msssn

22 and BA RR ss

Page 62: Module Six: Outlier Detection for Two Sample Case

62

Case Four: When the model is a concurrent model

The systematic bias effects between material A and B is a retio relationship: LiA= k(LiB). The observed two-sample data are (XiA, XiB).

• An approach to determine if the model is concurrent or not

1. Compute the ratio of the variable with larger measurements to the smaller variable, so that the ratio is larger than one. Suppose XiA > XiB . Compute ki = XiA/ XiB for each lab.

2. Compute the average of ki :

3. If there is a consistent pattern of a similar ratios , then, the model is considered as concurrent model. (Note: there is no specific rule available.)

n

kk i

Page 63: Module Six: Outlier Detection for Two Sample Case

63

Interpretation of Youden’s Plot for case Three

For this case, the two variances of random errors of two materials are different, the angle of the ellipse depending on the magnitude of the variances.

1. If

Then, tan will be larger than one, and the angle is larger than 45o. When , then, the major principal component will approach 90o, and it will be almost vertical.

2. On the other hand, if

Then, Then, tan will be smaller than one, and the angle is smaller than 45o. When , then, the major principal component will approach 0o, and it will be almost horizontal.

22L

2 and s torelative small very is BA RR ss

22 n larger thamuch relatively is LR ssB

22L

2 and s torelative small very is AB RR ss

22 n larger thamuch relatively is LR ssA

Page 64: Module Six: Outlier Detection for Two Sample Case

64

Transforming concurrent model to additive model

And analyzing the concurrent model using an additive model based on the transformed data.

The complexity of concurrent model suggests for a transformation from concurrent model to additive model is preferred.

• Suppose XiA > XiB , and the average of ki is computed:

• Compute a new variable: X’iA = XiA/

• Analyze the two-sample data using the additive model approach based on (X’iA, XiB).

The analysis and the interpretation of Youden plots can be extended to this transformed two-sample data.

k

k

Page 65: Module Six: Outlier Detection for Two Sample Case

65

Examples for the interpretation of Youden’s Plot

Consider the Tensile Strength Data. There are three rubber material, E,G and H. The data are: Row Laboratory E G H

1 1 94 45 80

2 2 103 43 82

3 3 94 42 77

4 4 99 50 83

5 5 97 47 86

6 6 91 66 76

7 7 91 96 81

8 8 102 128 98

9 9 98 73 83

10 10 91 54 81

11 11 93 81 82

12 12 82 45 69

13 13 93 127 81

14 14 82 150 72

15 15 83 70 73

16 16 92 89 73

Page 66: Module Six: Outlier Detection for Two Sample Case

66

We conducted the (H,E) Youden plot for (H,E) before. In the following we will analyze the Youden plot using the variance components of an additive model.

1. A quick check of the data, the ratio between each pair of material is a little over one. Therefore, we will consider the additive model for this data.

2. Using Minitab, we obtain:

úûùêëéúûùêëé

9625.40

0292.354292.46

42 component principalmajor theof angle the,56.8 ,832.78

8.92 ,8.79

22

2121

o21

21

s

srss

XX

3. The variance components are:

44.2s 3.38, s ,92.5s :are deviations Standard

93.50292.359625.40

40.110292.354292.46s ,0292.35

A

A

RL

2

2R

2

B

B

R

R

L

s

s

Page 67: Module Six: Outlier Detection for Two Sample Case

67

Based on these information, the following conclusions are obtained:

1. The systematic effect dominates random errors. This indicates there are large laboratory differences.

2. The random errors from both material are small and about the same.

3. To see if there is any outlier labs, a 95% ellipse region is constructed, and showed one serious deficient lab. The two-sam ple plot with the 95% coverage probability for the

tensile strength data is given by:

Raw Data

95% Ellipse

60 70 80 90 100

80

90

100

110

Material H

Ma

teri

al E

Two-Sample PLot for the Tensile Strength Data- Materal H Vs Material E

We notice that one lab is outside of the 95% coverage region. This is Lab 8 with two-sample results (98,102)

4. The linear pattern indicates the systematic error is large. Since for a given, when comparing with others, a linear pattern suggests the data from both material are either both high or both low, which shows the existence of large systematic errors between labs.

Page 68: Module Six: Outlier Detection for Two Sample Case

68

Youden plot for Material H and Material G

Results of descriptive summary and Principal Component analysis are given ine following

Covariances: H, G

H G

H 46.4292

G 33.8750 1178.7833

Variable N Mean Median TrMean StDev SE Mean

H 16 79.81 81.00 79.29 6.81 1.70

G 16 75.38 68.00 72.43 34.33 8.58

Eigenvalue 1179.8 45.4

Proportion 0.963 0.037

Variable PC1 PC2

H 0.030 1.000

G 1.000 -0.030

Cos = .03, hence the angle of the major principal component = 88o

The ratios of H to G are varied a lot. Therefore, concurrent model is not appropriate. We apply additive model for this data.

Page 69: Module Six: Outlier Detection for Two Sample Case

69

The variance components based on an additive model are:

8.33s 3.5, s ,8.5s :are deviations Standard

90.1144875.3378.1178

55.12875.334292.46s ,875.33

A

A

RL

2

2R

2

B

B

R

R

L

s

s

0 100 200

0

100

200

H

G

Scatter plot of material H and G with proper scale

Scatter plot indicates a huge variation among G material

Page 70: Module Six: Outlier Detection for Two Sample Case

70

Data95%Ellipse

60 70 80 90 100

0

100

200

Material H

Ma

teri

al G

Two Sample Youden Plot for Material H (X) Vs Material G (Y)

Youden plot indicates an potential outlier labs, Lab#8 with data (98,128) and Lab#14 with data (72,150).

65 70 75 80 85 90 95 100

30

50

70

90

110

130

150

HG

Marginal Plot for Material H and G

There is a clear outlier lab for material H. Material G is highly skewed and

large uncertainty among labs.

Page 71: Module Six: Outlier Detection for Two Sample Case

71

Conclusions:

1. Scatter plot indicates a huge variation among G testing material.

2. The major principal component is at 88o, almost vertical. This suggests labs testing material G are very inconsistent.

3. Further variance decomposition shows the the random error variance extremely dominates the uncertainty component among labs for material G.

4. The systematic bias effect is small. The random error for material H is also small.

5. There is an outlier lab: Lab #8 according to the 95% ellipse region.

6. Marginal box-plot indicates a clear outlier lab in material H. For material G, the variation is huge and stretched from 43 to 150. At least two labs are much larger than others. This result also indicates that the material G is much more sensitive to different environment than material H. Some extremely high strength deserve further investigation. In particular, one should try to locate special causes for such a high strength, and try to reproduce the results, if possible.

Page 72: Module Six: Outlier Detection for Two Sample Case

72

The issue about computation

In this module, we have used a variety of procedures in Minitab, I order to accomplish our analysis. The procedures we applied include:

•Graphs:

Plot, Marginal Plot, Box Plot, Time Series Plot, Histogram, Probability Plot.

•Statistical Procedures:

Descriptive statistics, Covariance, Correlation, Normality test, Regression, Individual control chart, Principal Component Analysis, and Bivariate control chart.

The combination of applying the right tool at the right time is the most important, yet, most difficult. It will take a good understanding of the statistical concepts as well as proficient usage of Minitab.

Page 73: Module Six: Outlier Detection for Two Sample Case

73

Hands-On Activity#1:

Repeat the same analysis of the Tensile Data.

Hands-On Activity #2:

Conduct a thorough investigation of of the TAPPI Inter-laboratory study.

Hands-On activity #3:

Bring your own two-sample data and conduct a thorough investigation of your data.