§2 - yvette butterworth |...

22
Chapter 2: Organizing Data Section Title Notes Pages 1 Frequency Distributions, Histograms & Related Topics 2 – 6 2 Bar Graphs, Circle Graphs & Time-Series Graphs 7 – 10 3 Stem-and-Leaf Displays 11 – 12 Y. Butterworth Ch. 2 Brase’s Statistics (Foothill) 1

Upload: others

Post on 14-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: §2 - Yvette Butterworth | Welcomeprofbutterworth.com/foothill/foot_Stats/notes/ch2Brase9th.doc  · Web viewChapter 2: Organizing Data. Section Title Notes Pages. 1 Frequency Distributions,

Chapter 2: Organizing Data

Section Title Notes Pages1 Frequency Distributions, Histograms & Related Topics 2 – 62 Bar Graphs, Circle Graphs & Time-Series Graphs 7 – 103 Stem-and-Leaf Displays 11 – 12

Y. Butterworth Ch. 2 Brase’s Statistics (Foothill) 1

Page 2: §2 - Yvette Butterworth | Welcomeprofbutterworth.com/foothill/foot_Stats/notes/ch2Brase9th.doc  · Web viewChapter 2: Organizing Data. Section Title Notes Pages. 1 Frequency Distributions,

§2.1 Frequency Distributions, Histograms & Related Topics

There are 2 types of statistics: descriptive and inferentialIt is the descriptive statistic that we will be discussing in this chapter. A descriptive statistic describes the characteristics of data.

Following is a list of the 5 important characteristics:1) The center of the data which a representative value that indicates where the middle of the data lies. The most common descriptive statistic for measuring center is the average.2) The variation or scatter of the data is also important.3) The distribution of the data tells the shape of the data. We are very familiar

and will become even more so with a distribution call the Normal – this is a bell shaped curve, the type of distribution that class grades follow.

4) Outliers which are data points that lie outside the “normal” variation of the data.5) The change in characteristics of data over time.

Just like memorizing order of operations (Please Excuse My Dear Aunt Sally, PEMDAS) or the order of the planets in our solar system (Matilda Visits Every Monday Just Stays Until Noon Period) or the names of the Great Lakes (HOMES) , there is a mnemonic device to help you remember these important characteristics of data:

Computer Viruses Destroy Or Terminate (CVDOT)

It is my belief as a statistics instructor that you can not understand a statistic until you have learned how to calculate that statistic by hand, so even though we will be using technology to calculate many statistics I will still require you to be able to find basic descriptive statistics by hand for small data sets.

To help us in our study of the shape of data (distribution) we will be learning how to compile a frequency distribution (table), a relative frequency distribution and a cumulative frequency distribution. A frequency table will be created which will give categories (not like the categories for ordinal data) and the number of experimental units (data points) within each category. From these we will learn how to draw pictures that show us distributions. Let’s go through an example to learn the process. Here is a summary of the process to start:

Creating a Frequency Table ( Distribution)1. Decide upon the # of classes (the categories). There should be between 5 and 20 classes, but if many have 1 or 0 data points then you have chosen too many. You will normally be told how many classes to use.2. Find the class width (the range of data points in each class; the distance between successive lower class limits, upper class limits or class midpoints/marks). This is the

Y. Butterworth Ch. 2 Brase’s Statistics (Foothill) 2

Page 3: §2 - Yvette Butterworth | Welcomeprofbutterworth.com/foothill/foot_Stats/notes/ch2Brase9th.doc  · Web viewChapter 2: Organizing Data. Section Title Notes Pages. 1 Frequency Distributions,

(Maximum Minimum)/ # of classes ** You will need to round up if it is not a whole3. Select a lower limit for the 1st class. This point does not have to be the lowest data point, but it should make sense in terms of the data (no negatives if negatives don’t exist, etc.). To this lower limit add the class width. This is the lower limit of the next class, not the upper limit of the first class!!!! Continue until you’ve gotten all your classes. (You’ll need to go to the lower limit of a class beyond what you need.)4. Since the last class has an upper limit that you have not found, be sure to fill it in, as well as the other upper limits. (This is done by going one unit below the next lower limit.)5. Make your counts/tallies as to the number of data points in each class.

Example: The ages of people arrested for purse snatching are:

16, 41, 25, 21, 30, 17, 29, 50, 30, & 39

Using 4 classes, create a frequency table (distribution) for the data.

# of Classes:Class Width:

Lower Limits:

Place Upper Limits:Make counts:

Some places where students go wrong and that you should be cautious, are:1) Are your classes mutually exclusive?2) Did you include all classes even if one was zero?3) Does the sum of frequencies add to the number of data points? This is really

important!

Next, we need to discuss the relative-frequency table (distribution) which is simply our frequency distribution listed in terms of percent of the whole. To create a relative frequency distribution you first create a frequency distribution and then divide each class frequency by the total number of data points (sum of all frequencies).

Rel. Freq. = Class Freq. Sum of Freq.*Round appropriately.

Y. Butterworth Ch. 2 Brase’s Statistics (Foothill) 3

Page 4: §2 - Yvette Butterworth | Welcomeprofbutterworth.com/foothill/foot_Stats/notes/ch2Brase9th.doc  · Web viewChapter 2: Organizing Data. Section Title Notes Pages. 1 Frequency Distributions,

Example: For the above purse-snatcher data create the relative-frequency table (distribution).

Finally, we need to know how to create a cumulative frequency table (distribution). This is just a running total that includes all classes below, but not any above. In a cumulative frequency distribution, the cumulative frequency in the last class must be the total number of data points. The classes are typically written in a little different manner for a cumulative frequency table, but your book continues on with the same lover and upper class boundary labeling. The correct labeling uses less than, and uses the lower limit of the next class.

Example: For the purse-snatcher data create a cumulative frequency table

What does this all mean? Well, this is just a preview of how the data is behaving. We are learning where the data stacks up (freq table), what percentage lies in certain areas (rel. freq. table) and at what point have we seen most of the data (cum freq table). A frequency table (distribution) is really just the first step in seeing the shape of the data, and we will be using it to draw an actual pictorial representation in the next section. But, at this early stage we can begin by comparing frequency distributions to other similar data or data over time and looking for patterns – and really looking for patterns and finding out how mathematically significant those patterns are is what statistics is all about!

Histograms are pictures (graphs) that show the shape of the data that we are beginning to see take shape from the frequency tables we just learned about. We will also discuss line graphs of frequency (freq. polygon) and cumulative frequency dist. (ogives).

A histogram is constructed from a freq. or rel. freq. table. The horizontal axis consists of the scale of the data with either the midpoint of the class (called class marks or midpoints; found by adding the lower & upper limit and dividing by 2) labeled or the class boundaries (the midpoint between upper and lower limit; found by subtracting consecutive upper and lower limits and dividing by 2

Y. Butterworth Ch. 2 Brase’s Statistics (Foothill) 4

Page 5: §2 - Yvette Butterworth | Welcomeprofbutterworth.com/foothill/foot_Stats/notes/ch2Brase9th.doc  · Web viewChapter 2: Organizing Data. Section Title Notes Pages. 1 Frequency Distributions,

– they are usually 0.5) marked. The vertical scale is the frequency. Each class is shown with a vertical bar, the left side of which is the lower class boundary and the right side is the upper class boundary (some people and Minitab will use the upper and lower class limits to create these bars, but that leaves a gap between bars where data could fall between or which could leave visual gaps which would allow the reader to misinterpret the data). There are no gaps between the bars on a histogram! Everything must be clearly marked – meaning that the left side is clearly marked with the lower class boundary and the right with the upper class boundary and the axes are labeled with their appropriate “units”. It should also be noted that some frequency histograms use the class midpoints/marks to label the bars.

Creating a Histogram1. Create a frequency or relative frequency table2. Find the class boundaries (or the class midpoints/marks)3. Create an axis system with the class boundaries on the horizontal axis & frequencies on the vertical axis (label clearly)4. Create a bar the width of the class and the height of the frequency to represent each class in the table.

Example: Create a histogram for the purse-snatcher data.

What does this do for us? Answer: It shows you the shape of the data! Let’s continue this discussion in just a moment. We need to discuss the shape of the data in a little more detail before we discuss the shape of this particular data.

The shape tells us where the “majority” of the data is “bunching” up. We will get even more specific about these names once we learn about some of the mathematical descriptors of data. For now they are very general terms that tell us how the data is “bunching”.

For the purse-snatcher data, we can see that as age increases the number of people snatching purses goes down and that the in-between ages is where we have the highest frequency. We see that it is a skewed data set; it is not mound-shaped symmetrical (normally distributed) (that bell-shaped curve that we see in grades). What is the skew?

Y. Butterworth Ch. 2 Brase’s Statistics (Foothill) 5

Mound-Shaped Symmetrical

Left Skewed Right Skewed Bimodal UniformAlso Symmetrical

Page 6: §2 - Yvette Butterworth | Welcomeprofbutterworth.com/foothill/foot_Stats/notes/ch2Brase9th.doc  · Web viewChapter 2: Organizing Data. Section Title Notes Pages. 1 Frequency Distributions,

If you create a relative frequency histogram you will not see any difference in shape. You are seeing the exact same trends; it is just represented in terms of percentages instead of raw numbers.

When we talk about a frequency polygon we are talking about the points that correspond to the class marks (mid-point of the class) and the frequency of each class. From these points we create a line graph. This line graph can be shown super-imposed on our histogram. The line graph is nice for showing trends, because our eye follows the slope of the line. As a result of seeing the slope we see increases and decreases according to class. Your book does not cover a frequency polygon, but I think that it is worth mentioning because it brings full circle the reasoning behind discussion of midpoints and also elaborates on another type of graph that is frequently seen in reporting data in popular literature. You may want to: Draw the frequency polygon over the histogram above.

An ogive is a line plot for a cumulative freq. These are constructed starting with the class boundaries and the cumulative frequencies. The major use would be to see the number of values below a particular point in the data. It can also help us to see an overall trend in data – where we are getting the most increase and where things level off. With some data this is important (see the example in the book) and with others it is not as useful. You may wish to create the ogive for the purse snatcher data.

Finally, I want to discuss the dot-plot; another type of visual representation of data. The dot plot also keeps all the original data, sorts it, but is worse at showing shape than the histogram. The dot plot is good at showing outliers, however.

Use a number line representative of the range of data values and place a dot for each point just above its representative number on the number line to construct a dot plot. For our purse-snatcher data, a dot plot is not too informative! This is what it looks like.

As I have hinted in these last couple of paragraphs, we have many ways of representing data and each has its strengths and weaknesses. Here is a summary:

Display Strength(s) WeaknessList All Data Shown Can’t see distribution/shape

If ordered, can see low to high

Frequency Table Begin to see distribution/shape Lose original data

Ogive See trends Lose detail of data

Dot Plot Keeps original data May not see distribution/shapeSee outliers

Y. Butterworth Ch. 2 Brase’s Statistics (Foothill) 6

15 30 45 55

Page 7: §2 - Yvette Butterworth | Welcomeprofbutterworth.com/foothill/foot_Stats/notes/ch2Brase9th.doc  · Web viewChapter 2: Organizing Data. Section Title Notes Pages. 1 Frequency Distributions,

Now, the word outlier has been brought up. What is an outlier? Answer: It is a data point(s) that lies “way outside” the “consistent grouping.” Again, we have some vague phrasing in here, and we will begin to get clarity on these once we have some mathematical descriptors of the data. For now, just as with the shape, we should know vaguely what an outlier is and realize that there is more information to come.

§2.2 Bar Graphs, Circle Graphs & Time-Series Graphs

This section discusses bar graphs, Pareto charts, pie charts or circle graphs and time-series graphs. The bar graphs shown in this section often called side-by-side bar charts. These show either quantitative or qualitative data counts (like a histogram) for multiple categorization of individuals; for example they can show the counts for male vs. female or for different ethnicities. The Pareto chart is a bar graph for qualitative data where the bars are ordered in descending height to help tell the story of what class (category) is most important. A pie chart/circle graph is the same type that you are familiar with from your everyday life and uses the Relative Frequency Table to show which classes (categories) are the largest or most influential or by using percentage of data in classes (categories). It should be noted that when percentages are used they must total 100% or a pie chart/circle graph is not appropriate since the circle symbolically represents a whole – 100%. The time-series graphs are useful for seeing trends over time by tracking quantitative data using a line graph. For a summary of the types of graphs, their appropriateness give qualitative and quantitative data and some general pointers about creating any type of graph please review p. 54-55 in blue.

Now we can look at an example of each of the above graph types. I’ll take as many examples as I can from your book exercises.

Y. Butterworth Ch. 2 Brase’s Statistics (Foothill) 7

Graph Type Type of Data Why _ Bar Graph Qualitative Freq or Percentage Displayed

Quantitative Measurements by sub-category

Pareto Chart Qualitative Frequency in Descending Order*Nice when all possible categories aren’t given

Pie Chart/Circle Graph Qualitative Total dispersed in categoriesQuantitative *Percentage of occurrence

*Must sum to 100%

Time Series Graph Quantitative *Shows change over time *Must be consistent time increments

NOTES ON GRAPHS1) Title every graph2) Label axes3) Identitify units of measure4) Be sure to adhere to any asterisk notation

Page 8: §2 - Yvette Butterworth | Welcomeprofbutterworth.com/foothill/foot_Stats/notes/ch2Brase9th.doc  · Web viewChapter 2: Organizing Data. Section Title Notes Pages. 1 Frequency Distributions,

Example: The following is data from Michael Sullivan’s, Statistics: Informed Decisions Using Data. The data represents the educational attainment of US adults 25 and older for both 1990 and 2003. The data is in thousands! Because the data represents such large numbers it is much more realistic to use a relative frequency to represent the categories.

Educational Attainment 1990 (count/rel freq) 2003 (count/rel freq)< 9th grade 16,502 (0.1039) 12,276 (0.0663)9th to 12th w/ no diploma 22,842 (0.1438) 16,323 (0.0881)High School Diploma 47,643 (0.2999) 59,292 (0.3202)Some College, No Degree 29,780 (0.1874) 31,762 (0.1715)AA or AS Degree 9,792 (0.0616) 15,147 (0.0818)BA or BS Degree 20,833 (0.1311) 33,213 (0.1794)Graduate or Professional Degree 11,478 (_________) 17,169 (__________)Total 158,870 185,182

a) Complete the relative frequencies for Graduate/Professional Degree for both 1990 & 2003.b) Complete the table below. Notice the labels.c) See how easily the differences in relative frequency are to note with this type of chart? At a glance it is easy to determine where there were increases and decreases. d) Visually, what had the most growth? Do the relative frequency back your claim? e) Looking at overall trends, how might you summarize what you see visually?

Note: This chart can be created quite easily by using EXCEL. I chose to do it by hand here, but with the use of the Charts Wizard you can recreate it in EXCEL. Just follow the step-by-step instructions in the dialogue boxes.

Y. Butterworth Ch. 2 Brase’s Statistics (Foothill) 8

0.1

0.2

0.3

0.05

0.15

0.25

0.35

Educational Attainment 1990 vs 2003

Page 9: §2 - Yvette Butterworth | Welcomeprofbutterworth.com/foothill/foot_Stats/notes/ch2Brase9th.doc  · Web viewChapter 2: Organizing Data. Section Title Notes Pages. 1 Frequency Distributions,

f) Could a pie chart be created for the 1990 data? What would a pie chart show? Would creating a pie chart for 1990 and 2003 and putting them side-by-side be as effective as the side-by-side bar graph?g) Could a Pareto chart be created for the 1990 data? What would the Pareto chart show? Would this be useful? Would a comparison of 1990 and 2003 Pareto charts be interesting?

Example: Let’s look at Exercise #10, p. 57 from Brase’s Understandable Statistics.

Based on a survey by Valvoline Oil Company of 500 drivers, as published in the USA Today, the top complaints of other drivers were as follows (it is not indicated whether each marked only one or if multiple were allowed):

Tailgaiting 22%No Turn Signals 19%Cutting Off 16%Driving Slow 11%Inconsiderate 8%

a) Make a Pareto chart for the data.b) Would a circle graph make sense for this data? Why or why not?

Example: Let’s look at Exercise #12, p. 57 from Brase’s Understandable Statistics.

According to the Physician’s Handbook, the average heights at different ages for boys, aged 0.5 through 14 are as follows:Age Height Age Height0.5 26 8 501 29 9 522 33 10 543 36 11 564 39 12 585 42 13 606 45 14 627 47

a) Finish the time series graph started below for this data.b) What useful information can we see from this graph? Why

might this better than a bar chart of the data?

Y. Butterworth Ch. 2 Brase’s Statistics (Foothill) 9

Page 10: §2 - Yvette Butterworth | Welcomeprofbutterworth.com/foothill/foot_Stats/notes/ch2Brase9th.doc  · Web viewChapter 2: Organizing Data. Section Title Notes Pages. 1 Frequency Distributions,

Note: This graph can be created in EXCEL using the chart wizard. This is also a graph that can be created on the TI-83 or 84 using the line plot in the stat plot menu.

Y. Butterworth Ch. 2 Brase’s Statistics (Foothill) 10

Age (Years)

1 5 10 14

24

27

30

33

36

39

42

45

48

51

54

57

60

63

Page 11: §2 - Yvette Butterworth | Welcomeprofbutterworth.com/foothill/foot_Stats/notes/ch2Brase9th.doc  · Web viewChapter 2: Organizing Data. Section Title Notes Pages. 1 Frequency Distributions,

§2.3 Stem-and-Leaf Displays

We then move to another visual called a stem and leaf plot.

A stem-and-leaf plot gives us a nice visualization of the data shape, but unlike the histogram, allows us to keep the original data. This is what makes it superior to a histogram – there is no loss of the original data. Another nice thing about a stem-and-leaf plot is that outliers are easy to spot.

This type of plot works with a stem, which is based upon the tens, hundreds, etc. and the leaf, which completes the number. You can think of this process like writing a number in expanded form! For large data sets it is sometimes necessary to give each stem two representations to spread the data out a little more (this does add some level of error to the process and thus misrepresentation of the data is possible), called an expanded stem-and-leaf (your book calls it a split stem & leaf). We can also bring the data together a little more for smaller data sets creating a condensed stem-and-leaf (the same problem with misrepresentation can occur).

Creating a Stem-and-Leaf Plot1. Look over the data and decide upon stem and leaf (all below 100, use a stem of 10’s)2. Write a table with stems on left and leaves on right (label the stem & leaf unit value)3. Fill in the leaves to represent all the data points*Note: Your book writes an example in the first row, with the example’s expansion as a way of showing the stem and leaf units. This is not typical and you will see me write it in the more typical manner in class, which is stems and leaves w/ multiplication by a factor of ten.

Example: Create a stem and leaf plot for the purse-snatchers data16, 41, 25, 21, 30, 17, 29, 50, 30 & 39

Now, if you look at this on its side, you will see the nearly the same shape as the histogram, but with the added benefit of still being able to see the original data. Another advantage of the stem-and-leaf is that it sorts the data, which prepares us for finding some other important descriptive statistics that require ordered or ranked data (such as medians, quartiles & percentiles).

Y. Butterworth Ch. 2 Brase’s Statistics (Foothill) 11

Page 12: §2 - Yvette Butterworth | Welcomeprofbutterworth.com/foothill/foot_Stats/notes/ch2Brase9th.doc  · Web viewChapter 2: Organizing Data. Section Title Notes Pages. 1 Frequency Distributions,

Let’s look at a data set that would be much better if viewed with an expanded stem-and-leaf plot.

Example: The following are 50 recorded decibel ratings from an unknown event (I can’t remember where they cam from anymore!)

52, 54, 54, 55, 55, 55, 55, 56, 56, 56, 56, 56, 57, 57, 58, 59, 59, 59, 59, 59, 60, 60, 60, 60, 60, 60, 61, 61, 61, 61, 62, 6262, 62, 63, 63, 63, 64, 64, 64, 64, 65, 66, 66, 67, 67, 67, 6868, 69, 77

*Note: This stem-and-leaf was created using Minitab. Unfortunately, we have no technology that is capable of creating a stem-and-leaf plot. We must create this handy visuals by hand!*Note2: This is an expanded stem-and-leaf plot where there are 5 stems for ten possible leaves. This is an extreme situation, most of the time you would simply need 2 stems for the 10 possible leaves.

This section and the preceding are very helpful in showing how to visualize data (CVDOT). We can easily see the center of the data, the variation, the distribution via a histogram and with the dot plot or stem and leaf we can see outliers very easily. What we have in the methods for visualization of data that we have discussed so far is a method of conducting Exploratory Data Analysis (EDA). EDA are ways of looking at data to explore, ask questions that were not originally considered and to explore the data in new directions. It is not unusual to begin an analysis of data and to end up seeing trends or patterns that lead to more questions and deeper analysis (answering further questions raised using statistical analysis however would technically require another investigation!). In every statistical analysis, EDA is very important and sometimes just the beginning of bigger studies!

Y. Butterworth Ch. 2 Brase’s Statistics (Foothill) 12