chee320
DESCRIPTION
CHEE320. Module 1: Graphical Methods for Analyzing Data, and Descriptive Statistics. Graphical Methods for Analyzing Data. What is the pattern of variability? Techniques histograms dot plots stem and leaf plots box plots quantile plots. Histogram. - PowerPoint PPT PresentationTRANSCRIPT
CHEE320 - Fall 2001
J. McLellan
CHEE320
Module 1: Graphical Methods for Analyzing Data, and Descriptive Statistics
CHEE320 - Fall 2001
J. McLellan 2
Graphical Methods for Analyzing Data
What is the pattern of variability?
Techniques
• histograms• dot plots• stem and leaf plots• box plots• quantile plots
CHEE320 - Fall 2001
J. McLellan 3
Histogram
• summary of frequency with which certain ranges of values occur
• ranges - “bins”
• choosing bin size - influences ability to recognize pattern
» too large - data clustered in a few bins - no indication of spread of data
» too small - data distributed with a few points in each bin - no indication of concentration of data
» there are quantitative rules for choosing the number of bins - typically automated in statistical software
• not automated in Excel!
CHEE320 - Fall 2001
J. McLellan 4
Histogram (lco90.STA 1v*768c)
LCO90
No
of
ob
s
0
20
40
60
80
100
120
140
160
180
200
220
240
260
280
300
<= 630(630,640]
(640,650](650,660]
(660,670](670,680]
(680,690](690,700]
(700,710](710,720]
(720,730]> 730
Histogram - Important Features
number of peaks
centre of gravity
spread in the data
tails? - extreme data points
max, min data values- range of values
symmetry?
CHEE320 - Fall 2001
J. McLellan 5
Dot Plots
• similar to histogram » plot data by value on horizontal axis» stack repeated values vertically» look for similar shape features as for histogram» e.g., data set for solder thickness» {0.1, 0.12, 0.09, 0.07, 0.09, 0.11, 0.1, 0.13, 0.1}
0.06 0.09 0.1 0.11 0.120.07 0.08 0.13
CHEE320 - Fall 2001
J. McLellan 6
Stem and Leaf Plots
• illustrate variability pattern using the numerical data itself• choose base division - “stem” • build “leaves” by taking digit next to base division
Decimal point is 1 place to the right of the colon
1 : 0124 1 : 58888 2 : 001112 2 : 557 3 : 3 : 6
12.0010.0014.0020.0018.0018.0025.0021.0036.0044.00
11.0015.0022.0021.0027.0025.0018.0021.0018.0020.00
Data
ToothDiscolorationby Fluoride
10-14
15-19
20-24
25-29
Stems
Leaves
CHEE320 - Fall 2001
J. McLellan 7
Stem and Leaf Plots
Solder example» numbers viewed as 0.070, 0.080, 0.090, 0.100, 0.110,…» decision - what is the stem?
• considerations similar to histogram - size of bins
Decimal point is 2 places to the left of the colon
7 : 0 8 : 9 : 00 10 : 000 11 : 0 12 : 0 13 : 0
CHEE320 - Fall 2001
J. McLellan 8
Box Plots
• graphical representation of “quartile” information» quartiles - describe how data occurs - ordering» 1st quartile - separates bottom 25% of data » 2nd quartile (median) - separates bottom 50% of data» 3rd quartile - separates bottom 75% of data
and extreme data values» add “whiskers” - extend from box to largest data point within
• upper quartile + 1.5 * interquartile range• lower quartile - 1.5 * interquartile range
» interquartile range = Q3 - Q1» plot outliers - data points outside Q3 + 1.5*IQR,
Q1-1.5*IQR
CHEE320 - Fall 2001
J. McLellan 9
Box Plot - for solder data
Non-Outlier MaxNon-Outlier Min
75%25%
Median
Box Plot (jsolder.STA 10v*10c)
TH
ICK
NE
S
THICKNES: 0
0.065
0.075
0.085
0.095
0.105
0.115
0.125
0.135
THICKNES
Interpretation• no outliers• relatively symmetric
distribution• longer tails on both sides• fairly tightly clustered
about centre
CHEE320 - Fall 2001
J. McLellan 10
Non-Outlier MaxNon-Outlier Min
75%25%
Median
Box Plot (teethdisc.STA 10v*20c)
DIS
CO
LO
R
VAR2: 1
8
12
16
20
24
28
32
DISCOLOR
Box Plot - for teeth discoloration
Interpretation• no outliers• asymmetric distribution -
long lower tail• some tails on both sides• fairly tightly clustered at
higher range of discoloration
CHEE320 - Fall 2001
J. McLellan 11
Quantile Plots
• plot cumulative progression of data» values vs. cumulative fraction of data » comparison to standard distribution shapes
• e.g., normal distribution, lognormal distribution, …
» can be plotted on special axes • analogous to semi-log graphs
to provide visual test for closeness to given distribution• e.g., test to see if data are normally distributed
CHEE320 - Fall 2001
J. McLellan 12
Quantile Plot - teeth discolorationQuantile-Quantile Plot of DISCOLOR (teethdisc.STA 10v*20c)
Distribution: Normal
y=20.798+7.916*x+eps
Theoretical Quantile
Ob
serv
ed
Va
lue
.01 .05 .1 .25 .5 .75 .9 .95 .99
5
10
15
20
25
30
35
40
45
50
-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5
Interpretation• data don’t follow linear
progression– underlying distribution
not normal?Note the irregular spacing - similar to “semi-log” paper - cumulative points shouldfollow linear progression on this scale if distribution is normal.
CHEE320 - Fall 2001
J. McLellan 13
Graphical Methods for Quality Investigations
• primary purpose - help organize information in quality investigation
Examples • Pareto Charts• Fishbone diagrams - Ishikawa diagrams
CHEE320 - Fall 2001
J. McLellan 14
Pareto Chart
• used to rank factors • typically present as a bar chart, listing in descending
order of significance• significance can be determined by
» number count - e.g., of defects attributed to specific causes
» by size of effect - e.g., based on coefficients in regression model
CHEE320 - Fall 2001
J. McLellan 15
Example - Circuit DefectsNumber of Defects Attributed to:Stamping_Oper_ID1Stamping_Missing 1Sold._Short 1Wire_Incorrect 1Raw_Cd_Damaged 1Comp._Extra_Part 2Comp._Missing 2Comp._Damaged 2TST_Mark_White_Mark 3Tst._Mark_EC_Mark 3Raw_CD_Shroud_Re. 3Sold._Splatter 5Comp._Improper_16Sold._Opens 7Sold._Cold_Joint 20Sold._Insufficient 40
Data fromMontgomery
CHEE320 - Fall 2001
J. McLellan 16
Pareto Chart
• for circuit defect data
Pareto Chart & Analysis; NO_DEFCT
40
20
7 6 5 3 3 3 2 2 2 1 1 1 1 10%
20%
40%
60%
80%
100%
0
20
40
60
80
100
Sol
d._I
nsuf
ficie
nt
Sol
d._C
old_
Join
t
Sol
d._O
pens
Com
p._I
mpr
oper
_1
Sol
d._S
plat
ter
TS
T_M
ark_
Whi
te_M
ark
Tst.
_Mar
k_E
C_M
ark
Raw
_CD
_Shr
oud_
Re.
Com
p._D
amag
ed
Com
p._E
xtra
_Par
t
Com
p._M
issi
ng
Sol
d._S
hort
Wir
e_In
corr
ect
Raw
_Cd_
Dam
aged
Sta
mpi
ng_O
per_
ID
Sta
mpi
ng_M
issi
ng
CHEE320 - Fall 2001
J. McLellan 17
Fishbone Diagrams
• organize causes in analysis» have spine, with cause types branching from spine, and
sub-groups branching further
Example - factors influencing poor conversion in reactive extrusion
poor conversion
initiator type
half-life
barrel temperature
temperaturedistribution along barreltemperature
control
polymer grade
catalyst used- metallocene/Ziegler-Natta
CHEE320 - Fall 2001
J. McLellan 18
Graphical Methods for Analyzing Data
Looking for time trends in data...
• Time sequence plot– look for
» jumps» ramps to new values» meandering - indicates time correlation in data» large amount of variation about general trend - indication
of large variance
} indicate shift in mean operation
CHEE320 - Fall 2001
J. McLellan 19
Time Sequence Plot - Naphtha 90% Point
90
% p
oin
t (d
eg
ree
s F
)
390
400
410
420
430
440
450
460
470
480
0 30 60 90 120 150 180 210 240 270
Time Sequence Plot
- for naphtha 90% point - indicates amount of heavy hydrocarbons present in gasoline range material
excursion - suddenshift in operation
meandering about average operating point- time correlation in data
CHEE320 - Fall 2001
J. McLellan 20
Graphical Methods for Analyzing Data
Monitoring process operation• Quality Control Charts
– time sequence plots with added indications of variation» account for fluctuations in values associated with natural
process noise» look for significant jumps - shifts - that exceed normal
range of variation of values» if significant shift occurs, stop and look for “assignable
causes”» essentially graphical “hypothesis tests”» can plot - measurements, sample averages, ranges,
standard deviations, ...
CHEE320 - Fall 2001
J. McLellan 21
Example - Monitoring Process Mean
• is the average process operation constant?• collect samples at time intervals, compute average, and
plot in time sequence plot• indication of process variation - standard deviation
estimated from prior data» propagates through sample average calculation» if “s” is sample standard deviation, calculated averages will
lie between of the historical average 99% of the time if the mean operation has NOT shifted
» values outside this range suggest that a shift in the mean operation has occurred - alarm - “something has happened”
ns /3
CHEE320 - Fall 2001
J. McLellan 22
Example - Monitoring Process Mean
• time sequence plot with these alarm limits is referred to as a “Shewhart X-bar Chart”
» X-bar - sample mean of XXX-BAR Mean:74.0012 (74.0012) Proc. sigma:.009785 (.009785) n:5
Samples
73.9880
74.0012
74.0143
1 5 10 15 20 25
upper andlower controllimits
centre-lineor targetline - indicatesmean when process is operating properly
no points exceed limits in a stateof statistical control
CHEE320 - Fall 2001
J. McLellan 23
Example - Monitoring Process Mean
X-BAR Mean:74.0022 (74.0022) Proc. sigma:.011832 (.011832) n:5
Samples
73.9864
74.0022
74.0181
1 5 10 15 20 25
• X-bar chartPoint exceedsregion of natural variation- significant shift has occurred
CHEE320 - Fall 2001
J. McLellan 24
Graphical Methods for Analyzing Data
Visualizing relationships between variables
Techniques
• scatterplots• scatterplot matrices
» also referred to as “casement plots”
CHEE320 - Fall 2001
J. McLellan 25
Scatterplots
,,, are also referred to as “x-y diagrams”• plot values of one variable against another • look for systematic trend in data
» nature of trend• linear?
• exponential?
• quadratic?
» degree of scatter - does spread increase/decrease over range?
• indication that variance isn’t constant over range of data
CHEE320 - Fall 2001
J. McLellan 26
Scatterplots - Example
Scatterplot (teeth 4v*20c)
FLUORIDE
DIS
CO
LO
R
5
10
15
20
25
30
35
40
45
50
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
trend - possiblynonlinear?
• tooth discoloration data - discoloration vs. fluoride
CHEE320 - Fall 2001
J. McLellan 27
Scatterplot - Example
Scatterplot (teeth 4v*20c)
BRUSHING
DIS
CO
LO
R
5
10
15
20
25
30
35
40
45
50
4 5 6 7 8 9 10 11 12 13
• tooth discoloration data -discoloration vs. brushing
signficant trend?- doesn’t appear tobe present
CHEE320 - Fall 2001
J. McLellan 28
Scatterplot - Example
Scatterplot (teeth 4v*20c)
BRUSHING
DIS
CO
LO
R
5
10
15
20
25
30
35
40
45
50
4 5 6 7 8 9 10 11 12 13
Variance appearsto decrease as # of brushings increases
• tooth discoloration data -discoloration vs. brushing
CHEE320 - Fall 2001
J. McLellan 29
Scatterplot matrices
… are a table of scatterplots for a set of variables
Look for - » systematic trend between “independent” variable and
dependent variables - to be described by estimated model
» systematic trend between supposedly independent variables - indicates that these quantities are correlated
• correlation can negatively ifluence model estimation results
• not independent information
• scatterplot matrices can be generated automatically with statistical software, manually using Excel
CHEE320 - Fall 2001
J. McLellan 30
Scatterplot Matrices - tooth data
Matrix Plot (teeth 4v*20c)
FLUORIDE
AGE
BRUSHING
DISCOLOR
CHEE320 - Fall 2001
J. McLellan 31
Describing Data Quantitatively
Approach - describe the pattern of variability using a few parameters
» efficient means of summarizing
Techniques• average - (sample “mean”)• sample standard deviation and variance• median• quartiles• interquartile range• ...
CHEE320 - Fall 2001
J. McLellan 32
Sample Mean - “Average”
Given “n” observations xi :
Notes - » sensitive to extreme data values - outliers - value can be
artificially raised or lowered
n
iixn
x1
1
CHEE320 - Fall 2001
J. McLellan 33
Sample Variance
• sum of squared deviations about the average» squaring - notion of distance (squared)» average - is the centre of gravity
• sample variance provides a measure of dispersion - spread - about the centre of gravity
n
ii xx
ns
1
22 )(1
1
Note that we divideby “n-1”, and NOT“n” - degrees of freedom argument
Note - there is an alternativeform of this equation whichis more convenient forcomputation.
CHEE320 - Fall 2001
J. McLellan 34
Sample Standard Deviation
… is simply
• sample standard deviation provides a more direct link to dispersion
» e.g., for Normal distribution• 95% of values lie within 2 standard devn’s of the mean
• 99% of values like within 3 standard devn’s of the mean
2ss
CHEE320 - Fall 2001
J. McLellan 35
Range
• provides a measure of spread in the data• defined as
maximum data value - minimum data value
• can be sensitive to extreme data points• is often monitored in quality control charts to see if
process variance is changing
CHEE320 - Fall 2001
J. McLellan 36
“Order” Statistics
… summarize the progression of observations in the data set
Quartiles » divide the data in quarters
Deciles» divide the data in tenths ...
CHEE320 - Fall 2001
J. McLellan 37
Quartiles
• order data - N data points {yi}, i=1,…N
• if N is odd, » median is observation
• if N is even,» median is
• i.e., midpoint between two middle points
2/)1( Ny
2
122
NN yy
CHEE320 - Fall 2001
J. McLellan 38
Quartiles - Q1 and Q3
• Q1: Compute (N+1)/4 = A.B
• Q3: Compute 3(N+1)/4 = A.B
» i.e., interpolate between adjacent points» Note - there are other conventions as well - e.g., for Q1,
take bottom half of data set, and take midpoint between middle two points if there are an even number of points...
)(*1 1 AAA yyByQ
)(*3 1 AAA yyByQ
CHEE320 - Fall 2001
J. McLellan 39
Quartiles - Example
• solder data set» observations» 0.1, 0.12, 0.09, 0.07, 0.09, 0.11, 0.1, 0.13, 0.1» ordered: 0.07, 0.09, 0.09, 0.1, 0.1, 0.1, 0.11, 0.12, 0.13» 9 points --> median is 5th observation: 0.1» Q1: (N+1)/4 = 2.5
• Q1 = 0.09+0.5*(0.09-0.09) = 0.9
» Q3: 3(N+1)/4 = 7.5• Q3 = 0.11 + 0.5*(0.12-0.11) = 0.115
CHEE320 - Fall 2001
J. McLellan 40
Robustness
… refers to whether a given descriptive statistic is sensitive to extreme data points
Examples
• sample mean» is sensitive to extreme points - extreme value pulls
average toward the extreme
• sample variance » sensitive to extreme points - large deviation from the
sample mean leads to inflated variance
• median, quartiles » relatively insensitive to extreme data points
CHEE320 - Fall 2001
J. McLellan 41
Robustness -Solder Data Example
• replace 0.13 by 0.5 - output from Excel
thickness
Mean 0.142222Median 0.1Mode 0.1Standard Deviation 0.134887Sample Variance 0.018194Range 0.43Minimum 0.07Maximum 0.5
thickness
Mean 0.101111Median 0.1Mode 0.1Standard Deviation 0.017638Sample Variance 0.000311Range 0.06Minimum 0.07Maximum 0.13
With 0.13 With 0.5
CHEE320 - Fall 2001
J. McLellan 42
Robustness
• Other robust statistics» “m-estimator” - involves iterative filtering out of extreme
data values, based on data distribution» trimmed mean - other bases for eliminating extreme data
point effect» median absolute deviation