ii. graphical displays of data like many other things, statistical analysis can suffer from garbage...

II. Graphical Displays of Data

Like many other things, statistical analysis can suffer from garbage in, garbage out

This often happens because no one bothered to look at the data.

Simple data displays can convey a lot of information.

A. Stem-and-Leaf Displays

Purpose: To provide a basis for evaluating the “shape” of the datawithout the loss of any information.

1. Basic Stem and Leaf Display

This technique is best illustrated by an example.

Pencil lead is actually a ceramic matrix filled with graphite.

A measure of the quality of many ceramic bodies is the porosity.

Porosity is a measure of the void space in the body.

The following data set represents the result of a porosity test on“good” pencil lead

12.1 13.5 11.7 12.5 12.5 12.7 13.3 13.8 12.3 12.7 12.5 12.8 12.3 13.5 14.3 13.5 13.2 14.2 13.7 13.9 13.6 13.6 13.3 13.3 13.0 12.6 13.2 14.2 14.1 13.7

Let the numbers to the left of the decimal point be the “stems”.

Let the numbers to the right of the decimal point be the “leaves”.

Stem Leaves 11: 12: 13: 14:

Representing the value 12.1:

Stem Leaves 11: 12: 1 13: 14:

Representing the first row of data:

Stem Leaves 11: 7 12: 1557 13: 5 14:

The entire data set:

Stem Leaves 11. 7 12. 1557375836 13. 538552796633027 14. 3221

This display is the “raw” stem-and-leaf display.

Usually, we “refine” the stem-and-leaf display.

First, we order the leaves on each stem.

Stem Leaves 11. 7 12. 1335556778 13. 022333555667789 14. 1223

Next, we add the depth information.

The depth represents how far from the closest end of the data set a particular point is.

For example, the data value 11.7 is the smallest observation; thus, it has a depth of 1.

What is the depth of the data value 14.3?

What is the depth of the data value 12.6?

The completed stem-and-leaf gives the depth of the last value on the stems for the top part of the display.

It gives the depth for the first value of the stems for the bottom part of the display.

We do not give the depth for the stem which contains the middle value of the data set.

In this case, the depth information would be ambiguous.

An aid for finding the depth is to report the number of leaves on each stem.

Until we reach the middle stem, the depth for any stem is just the depth reported from the previous stem plus the number of leaves on the stem.

Stem Leaves No. Depth 11. 7 1 1 12. 1335556778 10 11 13. 022333555667789 15 14. 1223 4 4

2. Stretched Stem-and-Leaf

Consider a “stretched” stem and leaf display.

Basically it splits each simple stem into two.

Let X* be X0 – X4 Let X• be X5 – X9

No. Depth11*11• 7 1 1 12* 133 3 4 12• 5556778 7 11 13* 022333 6 13• 555667789 9 1314* 1223 4 414•

Two other extensions of the basic stem-and-leaf display: • the squeezed stem-and-leaf display• side-by-side or back-to-back stem-and-leaf display

3. Reading a Data Display

Goal of a data display: let the data speak to you?

Like any conversation, some points are obvious, others come only from questioning the data.

Some obvious questions:

• What is the ``center'' of the data?• What is the ``spread'' of the data.

More subtle questions:

• Do the data follow some pattern?• Is the pattern symmetric?

• If the pattern is not symmetric, is it right or left tailed?

A right tailed or right skewed pattern:

A left tailed or left skewed pattern:

• Are there multiple peaks?

What do multiple peaks suggest?

• Are there outliers?

B. Box Plots

Purpose: To give a quick display of some important features of the data.

Note: The box plot represents a distillation of the data.

The stem-and-leaf display only loses the time order of the data.

The box plot loses some of the information in the data.

However, under several very reasonable assumptions, the information lost is of little or no value.

1. Preliminaries

The box plot is based upon:

• the median • the quartiles

To find these quantities, we first must order the data set.

Let $y_1, y_2, \cdots, y_n$ denote our data set.

Rearrange the data in ascending order, and let the new data set be denoted by

where

Note: the stem and leaf with ordered leaves is such an ordered data set.

a. The median

The median, , is the middle value of the ordered data setand is a measure of the “center”.

Literally, the median splits the data set into two equal parts.

y~

)()2()1( nyyy

)()2()1(,,,

nyyy

Let denote the “location” of the median in the ordered data set.

If n is odd, then is an integer; thus,

If n is even, then contains the fraction 1/2.

In such a case, the median is the average of the two values“closest” to the “center”.

m

2

1nm

m

)(

~m

yy

m

2~ 2/12/1

mm

yyy

First Example: The following five values represent the ash content of pencil lead.

y1 = 42.5 y(1) = 40.3 y2 = 40.3 y(2) = 42.5 y3 = 43.4 y(3) = 42.7 y4 = 43.0 y(4) = 43.0 y5 = 42.7 y(5) = 43.4

7.42~

32

15

2

1

)3(

yy

nm

Second example: the porosities of good pencil lead

Note: the stem and leaf is an ordered data set

No. Depth11*11• 7 1 1 12* 133 3 4 12• 5556778 7 11 13* 022333 6 13• 555667789 9 1314* 1223 4 414•

3.132

3.133.13

2~

5.152

130

2

1

)16()15(

yyy

nm

b. The upper and lower quartiles

While the median divides the data into two parts of equal numbers, the quartiles (Q1, Q3) divide the date into four parts.

Note: the second quantile (Q2) is the median.

Let be the location of the first and third quantiles. q

even is if

4

2

odd is if4

3

nn

nn

q

If is an integer, then

If is not an integer, then the quartile is the average of the two values closest to it.

q

q

)1(3

)(1

q

q

nyQ

yQ

2

2

)2/11()2/11(

3

)2/1()2/1(

1

qq

qq

nnyy

Q

yyQ

First example: The “good” pencil lead data

84

32

4

2

30

n

n

q 7.13

6.12

)23()831(3

)8(1

yyQ

yQ

Second example: breaking strength of yarn

y1 = 22.7 y(1) = 19.2 y2 = 25.7 y(2) = 20.7 y3 = 20.7 y(3) = 21.2 y4 = 26.7 y(4) = 22.7 y5 = 21.2 y(5) = 22.7 Y6 = 19.2 y(6) = 25.7 Y7 = 22.7 y(7) = 26.7

5.24

37

4

3 nq

2.242

7.257.22

2

95.202

2.217.20

2

)6()5(

3

)3()2(

1

yyQ

yyQ

2. The Box Plot Itself

We shall illustrate this technique through the porosity data for the “good” pencil lead.

1. Construct a horizontal scale, marked conveniently, which covers at least the range of the data

2. Find , Q1, Q3

|______|_____ |_____ |_____ |_____ | 10 11 12 13 14 15

y~

7.13

6.12

3.13~

3

1

Q

Q

y

Use Q1 and Q3 to make a rectangular box above the scale.

Draw a vertical line across the box for the median.

10 11 12 13 14 15

3. Determine the “Step”

The Interquartile Range is a measure of variability or spread defined by

Q3 - Q1

We define the stepsize by

Step = (1.5)(Q3 – Q1)\

For the good pencil lead data,

Q3 - Q1 = 13.7 – 12.6 = 1.1

Step = 1.5(1.1) = 1.65

4. Determine the “inner fences”

The fences help us isolate possible outliers

The inner fences define the bounds for the unquestionably good data

The Upper Inner fence (UIF) is

UIF = Q3 + Step

The Lower Inner Fence (LIF) is

LIF = Q1 – Step

For the good pencil lead data

UIF = 13.7 + 1.65 = 15.35

LIF = 12.6 – 1.65 = 10.95

5. Locate the most extreme data points which are on or within the inner fences.

These data values are called the adjacents.

Draw vertical lines at these points, and connect these points to the “box” with a horizontal line.

This line is called a whisker.

For the good pencil lead data, all of the values fall within the inner fences.

Thus, the adjacents are: 11.7 and 14.3

151413121110

6. Calculate the “outer fences”

The outer fences allow us to discriminate between “mild” and “extreme” outliers.

Data values between the inner and outer fences are considered mild.

Data values beyond the outer fences are considered extreme.

The Upper Outer Fence (UOF) is

UOF = Q3 + 2(step)

The Lower Outer Fence (LOF) is

LOF = Q1 - 2(step)

For the good pencil lead data

UOF = 13.7 + 2(1.65) = 17.0

LOF = 12.6 - 2(1.65) = 9.3

7. Mark possible “outliers”

We use a ◦ to denote the mild outliers.

We use a • to denote the extreme outliers.

Note: No outliers occur in our example.

Parallel Box Plots allow us to compare two or more sets of data.

The Key: must use a common scale.

Place box plots above each other or side-by-side.

____________o |-------|_____|______|--------| o o

____________ |-------|_____|______|--------| o

____________ |-------|_____|______|--------|

|---------------------------------------------------| scale

Box Plots can also be used to analyze designed experiments. When there are categorical factors, the design can be “unstripped” and analyzed using parallel box plots.

Example: Consider an experiment to study the influence of operating temperature and glass type on light output.

Temp. Glass Type Low High 550 1380 565 1365 A 540 1384 575 1374 584 1379 545 880 582 891 B 576 864 553 875 574 883

The resulting box plot is given below.

The box plots show that a higher temperature yields higher light output. Also at the low temperature, glass type does not affect light output, but at the high temperature, glass type A produces higher light output.

Light

Outp

ut

Temp.Glass

HighLowBABA

1050

1000

950

900

850

800

750

700

Importance of Box Plots:

Boxplots allow us to tell at a glance:

1. center

2. spread

3. outliers

Other important data displays:

• histograms

• time plots

We generally use software to generate all data displays.

The instructor should do an class demonstration using the software selected by the instructor.

ii. graphical displays of data like many other things, statistical analysis can suffer from garbage...

Documents

stem leaves

basic stem

middle stem

data value

simple stem

stretched stem

squeezed stem

previous stem