ii. graphical displays of data like many other things, statistical analysis can suffer from garbage...
TRANSCRIPT
II. Graphical Displays of Data
Like many other things, statistical analysis can suffer from garbage in, garbage out
This often happens because no one bothered to look at the data.
Simple data displays can convey a lot of information.
A. Stem-and-Leaf Displays
Purpose: To provide a basis for evaluating the “shape” of the datawithout the loss of any information.
1. Basic Stem and Leaf Display
This technique is best illustrated by an example.
Pencil lead is actually a ceramic matrix filled with graphite.
A measure of the quality of many ceramic bodies is the porosity.
Porosity is a measure of the void space in the body.
The following data set represents the result of a porosity test on“good” pencil lead
12.1 13.5 11.7 12.5 12.5 12.7 13.3 13.8 12.3 12.7 12.5 12.8 12.3 13.5 14.3 13.5 13.2 14.2 13.7 13.9 13.6 13.6 13.3 13.3 13.0 12.6 13.2 14.2 14.1 13.7
Let the numbers to the left of the decimal point be the “stems”.
Let the numbers to the right of the decimal point be the “leaves”.
Stem Leaves 11: 12: 13: 14:
Representing the value 12.1:
Stem Leaves 11: 12: 1 13: 14:
Representing the first row of data:
Stem Leaves 11: 7 12: 1557 13: 5 14:
The entire data set:
Stem Leaves 11. 7 12. 1557375836 13. 538552796633027 14. 3221
This display is the “raw” stem-and-leaf display.
Usually, we “refine” the stem-and-leaf display.
First, we order the leaves on each stem.
Stem Leaves 11. 7 12. 1335556778 13. 022333555667789 14. 1223
Next, we add the depth information.
The depth represents how far from the closest end of the data set a particular point is.
For example, the data value 11.7 is the smallest observation; thus, it has a depth of 1.
What is the depth of the data value 14.3?
What is the depth of the data value 12.6?
The completed stem-and-leaf gives the depth of the last value on the stems for the top part of the display.
It gives the depth for the first value of the stems for the bottom part of the display.
We do not give the depth for the stem which contains the middle value of the data set.
In this case, the depth information would be ambiguous.
An aid for finding the depth is to report the number of leaves on each stem.
Until we reach the middle stem, the depth for any stem is just the depth reported from the previous stem plus the number of leaves on the stem.
Stem Leaves No. Depth 11. 7 1 1 12. 1335556778 10 11 13. 022333555667789 15 14. 1223 4 4
2. Stretched Stem-and-Leaf
Consider a “stretched” stem and leaf display.
Basically it splits each simple stem into two.
Let X* be X0 – X4 Let X• be X5 – X9
No. Depth11*11• 7 1 1 12* 133 3 4 12• 5556778 7 11 13* 022333 6 13• 555667789 9 1314* 1223 4 414•
Two other extensions of the basic stem-and-leaf display: • the squeezed stem-and-leaf display• side-by-side or back-to-back stem-and-leaf display
3. Reading a Data Display
Goal of a data display: let the data speak to you?
Like any conversation, some points are obvious, others come only from questioning the data.
Some obvious questions:
• What is the ``center'' of the data?• What is the ``spread'' of the data.
More subtle questions:
• Do the data follow some pattern?• Is the pattern symmetric?
• If the pattern is not symmetric, is it right or left tailed?
A right tailed or right skewed pattern:
A left tailed or left skewed pattern:
B. Box Plots
Purpose: To give a quick display of some important features of the data.
Note: The box plot represents a distillation of the data.
The stem-and-leaf display only loses the time order of the data.
The box plot loses some of the information in the data.
However, under several very reasonable assumptions, the information lost is of little or no value.
1. Preliminaries
The box plot is based upon:
• the median • the quartiles
To find these quantities, we first must order the data set.
Let $y_1, y_2, \cdots, y_n$ denote our data set.
Rearrange the data in ascending order, and let the new data set be denoted by
where
Note: the stem and leaf with ordered leaves is such an ordered data set.
a. The median
The median, , is the middle value of the ordered data setand is a measure of the “center”.
Literally, the median splits the data set into two equal parts.
y~
)()2()1( nyyy
)()2()1(,,,
nyyy
Let denote the “location” of the median in the ordered data set.
If n is odd, then is an integer; thus,
If n is even, then contains the fraction 1/2.
In such a case, the median is the average of the two values“closest” to the “center”.
m
2
1nm
m
)(
~m
yy
m
2~ 2/12/1
mm
yyy
First Example: The following five values represent the ash content of pencil lead.
y1 = 42.5 y(1) = 40.3 y2 = 40.3 y(2) = 42.5 y3 = 43.4 y(3) = 42.7 y4 = 43.0 y(4) = 43.0 y5 = 42.7 y(5) = 43.4
7.42~
32
15
2
1
)3(
yy
nm
Second example: the porosities of good pencil lead
Note: the stem and leaf is an ordered data set
No. Depth11*11• 7 1 1 12* 133 3 4 12• 5556778 7 11 13* 022333 6 13• 555667789 9 1314* 1223 4 414•
3.132
3.133.13
2~
5.152
130
2
1
)16()15(
yyy
nm
b. The upper and lower quartiles
While the median divides the data into two parts of equal numbers, the quartiles (Q1, Q3) divide the date into four parts.
Note: the second quantile (Q2) is the median.
Let be the location of the first and third quantiles. q
even is if
4
2
odd is if4
3
nn
nn
q
If is an integer, then
If is not an integer, then the quartile is the average of the two values closest to it.
q
q
)1(3
)(1
q
q
nyQ
yQ
2
2
)2/11()2/11(
3
)2/1()2/1(
1
nnyy
Q
yyQ
Second example: breaking strength of yarn
y1 = 22.7 y(1) = 19.2 y2 = 25.7 y(2) = 20.7 y3 = 20.7 y(3) = 21.2 y4 = 26.7 y(4) = 22.7 y5 = 21.2 y(5) = 22.7 Y6 = 19.2 y(6) = 25.7 Y7 = 22.7 y(7) = 26.7
5.24
37
4
3 nq
2.242
7.257.22
2
95.202
2.217.20
2
)6()5(
3
)3()2(
1
yyQ
yyQ
2. The Box Plot Itself
We shall illustrate this technique through the porosity data for the “good” pencil lead.
1. Construct a horizontal scale, marked conveniently, which covers at least the range of the data
2. Find , Q1, Q3
|______|_____ |_____ |_____ |_____ | 10 11 12 13 14 15
y~
7.13
6.12
3.13~
3
1
Q
Q
y
Use Q1 and Q3 to make a rectangular box above the scale.
Draw a vertical line across the box for the median.
10 11 12 13 14 15
3. Determine the “Step”
The Interquartile Range is a measure of variability or spread defined by
Q3 - Q1
We define the stepsize by
Step = (1.5)(Q3 – Q1)\
For the good pencil lead data,
Q3 - Q1 = 13.7 – 12.6 = 1.1
Step = 1.5(1.1) = 1.65
4. Determine the “inner fences”
The fences help us isolate possible outliers
The inner fences define the bounds for the unquestionably good data
The Upper Inner fence (UIF) is
UIF = Q3 + Step
The Lower Inner Fence (LIF) is
LIF = Q1 – Step
For the good pencil lead data
UIF = 13.7 + 1.65 = 15.35
LIF = 12.6 – 1.65 = 10.95
5. Locate the most extreme data points which are on or within the inner fences.
These data values are called the adjacents.
Draw vertical lines at these points, and connect these points to the “box” with a horizontal line.
This line is called a whisker.
For the good pencil lead data, all of the values fall within the inner fences.
Thus, the adjacents are: 11.7 and 14.3
151413121110
6. Calculate the “outer fences”
The outer fences allow us to discriminate between “mild” and “extreme” outliers.
Data values between the inner and outer fences are considered mild.
Data values beyond the outer fences are considered extreme.
The Upper Outer Fence (UOF) is
UOF = Q3 + 2(step)
The Lower Outer Fence (LOF) is
LOF = Q1 - 2(step)
For the good pencil lead data
UOF = 13.7 + 2(1.65) = 17.0
LOF = 12.6 - 2(1.65) = 9.3
7. Mark possible “outliers”
We use a ◦ to denote the mild outliers.
We use a • to denote the extreme outliers.
Note: No outliers occur in our example.
Parallel Box Plots allow us to compare two or more sets of data.
The Key: must use a common scale.
Place box plots above each other or side-by-side.
____________o |-------|_____|______|--------| o o
____________ |-------|_____|______|--------| o
____________ |-------|_____|______|--------|
|---------------------------------------------------| scale
Box Plots can also be used to analyze designed experiments. When there are categorical factors, the design can be “unstripped” and analyzed using parallel box plots.
Example: Consider an experiment to study the influence of operating temperature and glass type on light output.
Temp. Glass Type Low High 550 1380 565 1365 A 540 1384 575 1374 584 1379 545 880 582 891 B 576 864 553 875 574 883
The resulting box plot is given below.
The box plots show that a higher temperature yields higher light output. Also at the low temperature, glass type does not affect light output, but at the high temperature, glass type A produces higher light output.
Light
Outp
ut
Temp.Glass
HighLowBABA
1050
1000
950
900
850
800
750
700