chee320

42
CHEE320 - Fall 2001 J. McLellan CHEE320 Module 1: Graphical Methods for Analyzing Data, and Descriptive Statistics

Upload: adriel

Post on 05-Jan-2016

25 views

Category:

Documents


0 download

DESCRIPTION

CHEE320. Module 1: Graphical Methods for Analyzing Data, and Descriptive Statistics. Graphical Methods for Analyzing Data. What is the pattern of variability? Techniques histograms dot plots stem and leaf plots box plots quantile plots. Histogram. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CHEE320

CHEE320 - Fall 2001

J. McLellan

CHEE320

Module 1: Graphical Methods for Analyzing Data, and Descriptive Statistics

Page 2: CHEE320

CHEE320 - Fall 2001

J. McLellan 2

Graphical Methods for Analyzing Data

What is the pattern of variability?

Techniques

• histograms• dot plots• stem and leaf plots• box plots• quantile plots

Page 3: CHEE320

CHEE320 - Fall 2001

J. McLellan 3

Histogram

• summary of frequency with which certain ranges of values occur

• ranges - “bins”

• choosing bin size - influences ability to recognize pattern

» too large - data clustered in a few bins - no indication of spread of data

» too small - data distributed with a few points in each bin - no indication of concentration of data

» there are quantitative rules for choosing the number of bins - typically automated in statistical software

• not automated in Excel!

Page 4: CHEE320

CHEE320 - Fall 2001

J. McLellan 4

Histogram (lco90.STA 1v*768c)

LCO90

No

of

ob

s

0

20

40

60

80

100

120

140

160

180

200

220

240

260

280

300

<= 630(630,640]

(640,650](650,660]

(660,670](670,680]

(680,690](690,700]

(700,710](710,720]

(720,730]> 730

Histogram - Important Features

number of peaks

centre of gravity

spread in the data

tails? - extreme data points

max, min data values- range of values

symmetry?

Page 5: CHEE320

CHEE320 - Fall 2001

J. McLellan 5

Dot Plots

• similar to histogram » plot data by value on horizontal axis» stack repeated values vertically» look for similar shape features as for histogram» e.g., data set for solder thickness» {0.1, 0.12, 0.09, 0.07, 0.09, 0.11, 0.1, 0.13, 0.1}

0.06 0.09 0.1 0.11 0.120.07 0.08 0.13

Page 6: CHEE320

CHEE320 - Fall 2001

J. McLellan 6

Stem and Leaf Plots

• illustrate variability pattern using the numerical data itself• choose base division - “stem” • build “leaves” by taking digit next to base division

Decimal point is 1 place to the right of the colon

1 : 0124 1 : 58888 2 : 001112 2 : 557 3 : 3 : 6

12.0010.0014.0020.0018.0018.0025.0021.0036.0044.00

11.0015.0022.0021.0027.0025.0018.0021.0018.0020.00

Data

ToothDiscolorationby Fluoride

10-14

15-19

20-24

25-29

Stems

Leaves

Page 7: CHEE320

CHEE320 - Fall 2001

J. McLellan 7

Stem and Leaf Plots

Solder example» numbers viewed as 0.070, 0.080, 0.090, 0.100, 0.110,…» decision - what is the stem?

• considerations similar to histogram - size of bins

Decimal point is 2 places to the left of the colon

7 : 0 8 : 9 : 00 10 : 000 11 : 0 12 : 0 13 : 0

Page 8: CHEE320

CHEE320 - Fall 2001

J. McLellan 8

Box Plots

• graphical representation of “quartile” information» quartiles - describe how data occurs - ordering» 1st quartile - separates bottom 25% of data » 2nd quartile (median) - separates bottom 50% of data» 3rd quartile - separates bottom 75% of data

and extreme data values» add “whiskers” - extend from box to largest data point within

• upper quartile + 1.5 * interquartile range• lower quartile - 1.5 * interquartile range

» interquartile range = Q3 - Q1» plot outliers - data points outside Q3 + 1.5*IQR,

Q1-1.5*IQR

Page 9: CHEE320

CHEE320 - Fall 2001

J. McLellan 9

Box Plot - for solder data

Non-Outlier MaxNon-Outlier Min

75%25%

Median

Box Plot (jsolder.STA 10v*10c)

TH

ICK

NE

S

THICKNES: 0

0.065

0.075

0.085

0.095

0.105

0.115

0.125

0.135

THICKNES

Interpretation• no outliers• relatively symmetric

distribution• longer tails on both sides• fairly tightly clustered

about centre

Page 10: CHEE320

CHEE320 - Fall 2001

J. McLellan 10

Non-Outlier MaxNon-Outlier Min

75%25%

Median

Box Plot (teethdisc.STA 10v*20c)

DIS

CO

LO

R

VAR2: 1

8

12

16

20

24

28

32

DISCOLOR

Box Plot - for teeth discoloration

Interpretation• no outliers• asymmetric distribution -

long lower tail• some tails on both sides• fairly tightly clustered at

higher range of discoloration

Page 11: CHEE320

CHEE320 - Fall 2001

J. McLellan 11

Quantile Plots

• plot cumulative progression of data» values vs. cumulative fraction of data » comparison to standard distribution shapes

• e.g., normal distribution, lognormal distribution, …

» can be plotted on special axes • analogous to semi-log graphs

to provide visual test for closeness to given distribution• e.g., test to see if data are normally distributed

Page 12: CHEE320

CHEE320 - Fall 2001

J. McLellan 12

Quantile Plot - teeth discolorationQuantile-Quantile Plot of DISCOLOR (teethdisc.STA 10v*20c)

Distribution: Normal

y=20.798+7.916*x+eps

Theoretical Quantile

Ob

serv

ed

Va

lue

.01 .05 .1 .25 .5 .75 .9 .95 .99

5

10

15

20

25

30

35

40

45

50

-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5

Interpretation• data don’t follow linear

progression– underlying distribution

not normal?Note the irregular spacing - similar to “semi-log” paper - cumulative points shouldfollow linear progression on this scale if distribution is normal.

Page 13: CHEE320

CHEE320 - Fall 2001

J. McLellan 13

Graphical Methods for Quality Investigations

• primary purpose - help organize information in quality investigation

Examples • Pareto Charts• Fishbone diagrams - Ishikawa diagrams

Page 14: CHEE320

CHEE320 - Fall 2001

J. McLellan 14

Pareto Chart

• used to rank factors • typically present as a bar chart, listing in descending

order of significance• significance can be determined by

» number count - e.g., of defects attributed to specific causes

» by size of effect - e.g., based on coefficients in regression model

Page 15: CHEE320

CHEE320 - Fall 2001

J. McLellan 15

Example - Circuit DefectsNumber of Defects Attributed to:Stamping_Oper_ID1Stamping_Missing 1Sold._Short 1Wire_Incorrect 1Raw_Cd_Damaged 1Comp._Extra_Part 2Comp._Missing 2Comp._Damaged 2TST_Mark_White_Mark 3Tst._Mark_EC_Mark 3Raw_CD_Shroud_Re. 3Sold._Splatter 5Comp._Improper_16Sold._Opens 7Sold._Cold_Joint 20Sold._Insufficient 40

Data fromMontgomery

Page 16: CHEE320

CHEE320 - Fall 2001

J. McLellan 16

Pareto Chart

• for circuit defect data

Pareto Chart & Analysis; NO_DEFCT

40

20

7 6 5 3 3 3 2 2 2 1 1 1 1 10%

20%

40%

60%

80%

100%

0

20

40

60

80

100

Sol

d._I

nsuf

ficie

nt

Sol

d._C

old_

Join

t

Sol

d._O

pens

Com

p._I

mpr

oper

_1

Sol

d._S

plat

ter

TS

T_M

ark_

Whi

te_M

ark

Tst.

_Mar

k_E

C_M

ark

Raw

_CD

_Shr

oud_

Re.

Com

p._D

amag

ed

Com

p._E

xtra

_Par

t

Com

p._M

issi

ng

Sol

d._S

hort

Wir

e_In

corr

ect

Raw

_Cd_

Dam

aged

Sta

mpi

ng_O

per_

ID

Sta

mpi

ng_M

issi

ng

Page 17: CHEE320

CHEE320 - Fall 2001

J. McLellan 17

Fishbone Diagrams

• organize causes in analysis» have spine, with cause types branching from spine, and

sub-groups branching further

Example - factors influencing poor conversion in reactive extrusion

poor conversion

initiator type

half-life

barrel temperature

temperaturedistribution along barreltemperature

control

polymer grade

catalyst used- metallocene/Ziegler-Natta

Page 18: CHEE320

CHEE320 - Fall 2001

J. McLellan 18

Graphical Methods for Analyzing Data

Looking for time trends in data...

• Time sequence plot– look for

» jumps» ramps to new values» meandering - indicates time correlation in data» large amount of variation about general trend - indication

of large variance

} indicate shift in mean operation

Page 19: CHEE320

CHEE320 - Fall 2001

J. McLellan 19

Time Sequence Plot - Naphtha 90% Point

90

% p

oin

t (d

eg

ree

s F

)

390

400

410

420

430

440

450

460

470

480

0 30 60 90 120 150 180 210 240 270

Time Sequence Plot

- for naphtha 90% point - indicates amount of heavy hydrocarbons present in gasoline range material

excursion - suddenshift in operation

meandering about average operating point- time correlation in data

Page 20: CHEE320

CHEE320 - Fall 2001

J. McLellan 20

Graphical Methods for Analyzing Data

Monitoring process operation• Quality Control Charts

– time sequence plots with added indications of variation» account for fluctuations in values associated with natural

process noise» look for significant jumps - shifts - that exceed normal

range of variation of values» if significant shift occurs, stop and look for “assignable

causes”» essentially graphical “hypothesis tests”» can plot - measurements, sample averages, ranges,

standard deviations, ...

Page 21: CHEE320

CHEE320 - Fall 2001

J. McLellan 21

Example - Monitoring Process Mean

• is the average process operation constant?• collect samples at time intervals, compute average, and

plot in time sequence plot• indication of process variation - standard deviation

estimated from prior data» propagates through sample average calculation» if “s” is sample standard deviation, calculated averages will

lie between of the historical average 99% of the time if the mean operation has NOT shifted

» values outside this range suggest that a shift in the mean operation has occurred - alarm - “something has happened”

ns /3

Page 22: CHEE320

CHEE320 - Fall 2001

J. McLellan 22

Example - Monitoring Process Mean

• time sequence plot with these alarm limits is referred to as a “Shewhart X-bar Chart”

» X-bar - sample mean of XXX-BAR Mean:74.0012 (74.0012) Proc. sigma:.009785 (.009785) n:5

Samples

73.9880

74.0012

74.0143

1 5 10 15 20 25

upper andlower controllimits

centre-lineor targetline - indicatesmean when process is operating properly

no points exceed limits in a stateof statistical control

Page 23: CHEE320

CHEE320 - Fall 2001

J. McLellan 23

Example - Monitoring Process Mean

X-BAR Mean:74.0022 (74.0022) Proc. sigma:.011832 (.011832) n:5

Samples

73.9864

74.0022

74.0181

1 5 10 15 20 25

• X-bar chartPoint exceedsregion of natural variation- significant shift has occurred

Page 24: CHEE320

CHEE320 - Fall 2001

J. McLellan 24

Graphical Methods for Analyzing Data

Visualizing relationships between variables

Techniques

• scatterplots• scatterplot matrices

» also referred to as “casement plots”

Page 25: CHEE320

CHEE320 - Fall 2001

J. McLellan 25

Scatterplots

,,, are also referred to as “x-y diagrams”• plot values of one variable against another • look for systematic trend in data

» nature of trend• linear?

• exponential?

• quadratic?

» degree of scatter - does spread increase/decrease over range?

• indication that variance isn’t constant over range of data

Page 26: CHEE320

CHEE320 - Fall 2001

J. McLellan 26

Scatterplots - Example

Scatterplot (teeth 4v*20c)

FLUORIDE

DIS

CO

LO

R

5

10

15

20

25

30

35

40

45

50

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

trend - possiblynonlinear?

• tooth discoloration data - discoloration vs. fluoride

Page 27: CHEE320

CHEE320 - Fall 2001

J. McLellan 27

Scatterplot - Example

Scatterplot (teeth 4v*20c)

BRUSHING

DIS

CO

LO

R

5

10

15

20

25

30

35

40

45

50

4 5 6 7 8 9 10 11 12 13

• tooth discoloration data -discoloration vs. brushing

signficant trend?- doesn’t appear tobe present

Page 28: CHEE320

CHEE320 - Fall 2001

J. McLellan 28

Scatterplot - Example

Scatterplot (teeth 4v*20c)

BRUSHING

DIS

CO

LO

R

5

10

15

20

25

30

35

40

45

50

4 5 6 7 8 9 10 11 12 13

Variance appearsto decrease as # of brushings increases

• tooth discoloration data -discoloration vs. brushing

Page 29: CHEE320

CHEE320 - Fall 2001

J. McLellan 29

Scatterplot matrices

… are a table of scatterplots for a set of variables

Look for - » systematic trend between “independent” variable and

dependent variables - to be described by estimated model

» systematic trend between supposedly independent variables - indicates that these quantities are correlated

• correlation can negatively ifluence model estimation results

• not independent information

• scatterplot matrices can be generated automatically with statistical software, manually using Excel

Page 30: CHEE320

CHEE320 - Fall 2001

J. McLellan 30

Scatterplot Matrices - tooth data

Matrix Plot (teeth 4v*20c)

FLUORIDE

AGE

BRUSHING

DISCOLOR

Page 31: CHEE320

CHEE320 - Fall 2001

J. McLellan 31

Describing Data Quantitatively

Approach - describe the pattern of variability using a few parameters

» efficient means of summarizing

Techniques• average - (sample “mean”)• sample standard deviation and variance• median• quartiles• interquartile range• ...

Page 32: CHEE320

CHEE320 - Fall 2001

J. McLellan 32

Sample Mean - “Average”

Given “n” observations xi :

Notes - » sensitive to extreme data values - outliers - value can be

artificially raised or lowered

n

iixn

x1

1

Page 33: CHEE320

CHEE320 - Fall 2001

J. McLellan 33

Sample Variance

• sum of squared deviations about the average» squaring - notion of distance (squared)» average - is the centre of gravity

• sample variance provides a measure of dispersion - spread - about the centre of gravity

n

ii xx

ns

1

22 )(1

1

Note that we divideby “n-1”, and NOT“n” - degrees of freedom argument

Note - there is an alternativeform of this equation whichis more convenient forcomputation.

Page 34: CHEE320

CHEE320 - Fall 2001

J. McLellan 34

Sample Standard Deviation

… is simply

• sample standard deviation provides a more direct link to dispersion

» e.g., for Normal distribution• 95% of values lie within 2 standard devn’s of the mean

• 99% of values like within 3 standard devn’s of the mean

2ss

Page 35: CHEE320

CHEE320 - Fall 2001

J. McLellan 35

Range

• provides a measure of spread in the data• defined as

maximum data value - minimum data value

• can be sensitive to extreme data points• is often monitored in quality control charts to see if

process variance is changing

Page 36: CHEE320

CHEE320 - Fall 2001

J. McLellan 36

“Order” Statistics

… summarize the progression of observations in the data set

Quartiles » divide the data in quarters

Deciles» divide the data in tenths ...

Page 37: CHEE320

CHEE320 - Fall 2001

J. McLellan 37

Quartiles

• order data - N data points {yi}, i=1,…N

• if N is odd, » median is observation

• if N is even,» median is

• i.e., midpoint between two middle points

2/)1( Ny

2

122

NN yy

Page 38: CHEE320

CHEE320 - Fall 2001

J. McLellan 38

Quartiles - Q1 and Q3

• Q1: Compute (N+1)/4 = A.B

• Q3: Compute 3(N+1)/4 = A.B

» i.e., interpolate between adjacent points» Note - there are other conventions as well - e.g., for Q1,

take bottom half of data set, and take midpoint between middle two points if there are an even number of points...

)(*1 1 AAA yyByQ

)(*3 1 AAA yyByQ

Page 39: CHEE320

CHEE320 - Fall 2001

J. McLellan 39

Quartiles - Example

• solder data set» observations» 0.1, 0.12, 0.09, 0.07, 0.09, 0.11, 0.1, 0.13, 0.1» ordered: 0.07, 0.09, 0.09, 0.1, 0.1, 0.1, 0.11, 0.12, 0.13» 9 points --> median is 5th observation: 0.1» Q1: (N+1)/4 = 2.5

• Q1 = 0.09+0.5*(0.09-0.09) = 0.9

» Q3: 3(N+1)/4 = 7.5• Q3 = 0.11 + 0.5*(0.12-0.11) = 0.115

Page 40: CHEE320

CHEE320 - Fall 2001

J. McLellan 40

Robustness

… refers to whether a given descriptive statistic is sensitive to extreme data points

Examples

• sample mean» is sensitive to extreme points - extreme value pulls

average toward the extreme

• sample variance » sensitive to extreme points - large deviation from the

sample mean leads to inflated variance

• median, quartiles » relatively insensitive to extreme data points

Page 41: CHEE320

CHEE320 - Fall 2001

J. McLellan 41

Robustness -Solder Data Example

• replace 0.13 by 0.5 - output from Excel

thickness

Mean 0.142222Median 0.1Mode 0.1Standard Deviation 0.134887Sample Variance 0.018194Range 0.43Minimum 0.07Maximum 0.5

thickness

Mean 0.101111Median 0.1Mode 0.1Standard Deviation 0.017638Sample Variance 0.000311Range 0.06Minimum 0.07Maximum 0.13

With 0.13 With 0.5

Page 42: CHEE320

CHEE320 - Fall 2001

J. McLellan 42

Robustness

• Other robust statistics» “m-estimator” - involves iterative filtering out of extreme

data values, based on data distribution» trimmed mean - other bases for eliminating extreme data

point effect» median absolute deviation