asalaslemand.weebly.comasalaslemand.weebly.com/uploads/3/1/3/1/31310805/lecture... · web...

24
1 Lectures 4 Updated: Wed, Sep 17 th . Announcement Tim Horton’s nutritional information example to illustrate concepts for chapter 4: oDisplaying and summarizing quantitative data

Upload: others

Post on 05-Apr-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: asalaslemand.weebly.comasalaslemand.weebly.com/uploads/3/1/3/1/31310805/lecture... · Web viewLectures 4. Updated: Wed, Sep 17. th. Announcement. Tim Horton’s nutritional information

1

Lectures 4Updated: Wed, Sep 17th.

Announcement Tim Horton’s nutritional

information example to illustrate concepts for chapter 4:oDisplaying and summarizing

quantitative data

Page 2: asalaslemand.weebly.comasalaslemand.weebly.com/uploads/3/1/3/1/31310805/lecture... · Web viewLectures 4. Updated: Wed, Sep 17. th. Announcement. Tim Horton’s nutritional information

2

 

Access Ability Services

Thank you for your interest in becoming a Volunteer Note Taker!

Benefits:

  MAKE A DIFFERENCE!

  Contribute to your peer community

  Build your resume & receive a certificate from UTSC.

 

Upload or scan your sample notes conveniently using myAIMS for Note Takers:

 Go to www.utsc.utoronto.ca/ability homepage & find the ‘myAIMS’ link in the top right corner.

1. Follow the simple step-by-step process to register.2. You may upload/scan files from home or use the

scanner at AccessAbility Services SW302 

If you have any questions or need assistance please contact AccessAbility Services

Tel/TTY (416) 287-7560 or

[email protected]

Page 3: asalaslemand.weebly.comasalaslemand.weebly.com/uploads/3/1/3/1/31310805/lecture... · Web viewLectures 4. Updated: Wed, Sep 17. th. Announcement. Tim Horton’s nutritional information

3

Below is a snap shot of nutritional information for all donuts at Tim Horton’s. This is a real data and it can be produced from:

http://www.timhortons.com/ca/en/menu/nutrition-calculator.php#?

Page 4: asalaslemand.weebly.comasalaslemand.weebly.com/uploads/3/1/3/1/31310805/lecture... · Web viewLectures 4. Updated: Wed, Sep 17. th. Announcement. Tim Horton’s nutritional information

4

The data, donuts.csv is posted on Asal’s web link, under lecture notes.

Donuts Data

ID DonutType of Donut

Calories Fat Protein Carbs Fiber Sugar

1 Sugar Loop Donut Yeast 180 6 4 28 1 82 Maple Dip Donut Yeast 190 6 4 31 1 113 Honey Dip Donut Yeast 190 6 4 31 1 114 Chocolate Dip Donut Yeast 190 6 4 31 1 105 Maple Glazed Donut Yeast 210 8 4 32 1 136 Vanilla Dip with Coloured Sprinkles Yeast 250 6 4 46 1 247 Apple Fritter Donut Yeast 290 8 7 48 2 158 Caramel Apple Fritter Donut Yeast 300 8 7 52 2 179 Old Fashion Plain Donut Cake 210 10 3 25 1 810 Cinnamon Sugar Donut Cake 220 10 3 28 1 1011 Old Fashion Dip Donut Cake 250 10 3 36 1 1712 Sour Cream Cinnamon Donut Cake 270 16 3 29 1 1213 Old Fashion Glazed Donut Cake 270 10 3 41 1 2314 Double Chocolate Donut Cake 270 14 4 35 1 1615 Birthday Cake Donut Cake 280 11 3 42 1 2416 Chocolate Glazed Donut Cake 280 14 4 37 1 1917 Peanut Crunch Donut Cake 300 14 5 39 1 2018 Sour Cream Glazed Donut Cake 340 16 3 46 1 2919 Pumpkin Spice Donut Cake 250 9 4 41 1 2320 Strawberry Donut Filled 200 5 5 34 1 1421 Blueberry Donut Filled 200 5 4 34 1 1222 Canadian Maple Donut Filled 210 6 5 37 1 1623 Boston Cream Donut Filled 220 6 5 37 1 1524 Strawberry Bloom Donut Filled 230 7 4 39 1 1825 Banana Split Donut Filled 230 5 4 40 1 1826 Strawberry Shortcake Donut Filled 250 8 5 40 1 1527 Stanley Cup Donut Filled 270 6 5 48 1 2428 Strawberry Vanilla Donut Filled 270 5 5 52 1 3129 Oreo Donut Filled 400 15 5 61 1 3530 Honey Cruller Donut Other 310 18 2 37 0 22

Page 5: asalaslemand.weebly.comasalaslemand.weebly.com/uploads/3/1/3/1/31310805/lecture... · Web viewLectures 4. Updated: Wed, Sep 17. th. Announcement. Tim Horton’s nutritional information

5

Graphical display for a Quantitative Variable:

- Histogram- Stem-and-leaf plot- Dotplot- Boxplot

Note that each plot will show the shape, centre, and spread of the distribution. It will also show any unusual observation (point(s) plotted far away from the rest of the data). For each plot will refer to specific statistics to describe the centre and spread of the distribution. See each plot for the ways the distribution is described.

Page 6: asalaslemand.weebly.comasalaslemand.weebly.com/uploads/3/1/3/1/31310805/lecture... · Web viewLectures 4. Updated: Wed, Sep 17. th. Announcement. Tim Horton’s nutritional information

6

Histogram: Displays the entire distribution. Slice up all the possible values into equal-width bins, also called classes. Counts the number of cases that fall into each bin (class). Donuts Example:

o StatsCrunch: Graph>Histogram>Select Column(s)>Calories

The bins are from150 to 450. First Bin is from 150 to 200. This means that the number of

donuts whose calories are equal to 150 and more, but less than 200 (not including 200) are counted in the first bin. So, 150 ≤ calories < 200

The second bin is from 200 to 250.

Page 7: asalaslemand.weebly.comasalaslemand.weebly.com/uploads/3/1/3/1/31310805/lecture... · Web viewLectures 4. Updated: Wed, Sep 17. th. Announcement. Tim Horton’s nutritional information

7

Stem-and-leaf display: Created by John W. Tukey. Shows individual values of the data. Example: Sorted values of calories from donuts data.

Calories 180 190 190 190 200 200 210 210 210 220 220 230 230 250 250

250 250 270 270 270 270 270 280 280 290 300 300 310 340 400

o StatsCrunch: Graph>stem and leaf>Select Column(s)>Calories

Variable: Calories

Leaf unit = 10

1 : 89992 : 0011122332 : 5555777778893 : 00143 : 4 : 0

Steps to construct stem and leaf plots: Find the largest and the smallest number. In our example: 180

(smallest) and 400 (largest). Split all numbers into two parts: the stem and the leaf. The stem is the

left part of the number and the leaf is the right part. For example: let's look at a value in our data: 270. The number 2 is the

stem and 7 is the leaf. We drop the 0 from 270 (the leaf unit takes care of it. So, it looks like we have 27, but we multiply it by the leaf unit 10, and we get back 270. Great!).

The stem consists of a column of numbers in sequence starting with the smallest (or largest – displayed in text book examples).

The number of stems depends on the size of the data. Small data sets should have few stems. To figure out the number of stems, take the stem part of the smallest

number and subtract from the stem part of the largest number. In our

Page 8: asalaslemand.weebly.comasalaslemand.weebly.com/uploads/3/1/3/1/31310805/lecture... · Web viewLectures 4. Updated: Wed, Sep 17. th. Announcement. Tim Horton’s nutritional information

8

example the stem part of 180 (smallest number) is 1. And, the stem part of 400 (largest number) is 4. So, we have 4 -1 = 3. But, we are interested in knowing from 1 to 4 how many stems are there! There are 4 numbers from 1 to 4: 1, 2, 3, 4. So, everytime you subtract the stem part of the smallest number from the stem part of the largest number, you add 1 to it. In our example: 4 -1 = 3 + 1 = 4 numbers from 1 to 4 :)

Write the leaf part for every number on the same line as its stem. However, 4 stems is really short. We need minimum of 5 stems. Solution: Stretch stems. Construct two bins in which the leaf part for

numbers from 0 to 4 go in one line of stem, and the leaf part for numbers 5 to 9 go in another line of stem. In other words, you would repeat the stem number twice (see below display).

If you do not have any numbers that go into a bin for a stem line of the first and last row of the stem and leaf display, you will delete these lines, but if there are no numbers that go into the bins associated with a stem, in other rows (not first and last), you will leave the stem line(s) empty. See our example again:

Variable: Calories

Leaf unit = 10

1 : 89992 : 0011122332 : 5555777778893 : 00143 : 4 : 0

For the leaf:o use only 1 digit and drop the rest of the number.o Do not round.o Do not use commas.

Dotplot: Places a dot along an axis for each case in the data.

Page 9: asalaslemand.weebly.comasalaslemand.weebly.com/uploads/3/1/3/1/31310805/lecture... · Web viewLectures 4. Updated: Wed, Sep 17. th. Announcement. Tim Horton’s nutritional information

9

Donuts Example:o StatsCrunch: Graph>Histogram>Select Column(s)>Calories

Describe a distribution:Think about: - Shape

Page 10: asalaslemand.weebly.comasalaslemand.weebly.com/uploads/3/1/3/1/31310805/lecture... · Web viewLectures 4. Updated: Wed, Sep 17. th. Announcement. Tim Horton’s nutritional information

10

- Centre

- Spread

The shape of a distribution (eg. the histogram):

1. Notice the peaks (humps), also called mode:

a. If there is a single mode: unimodal – example:

b. If there are two peaks: bimodal – example:

c. If there are three or more peaks: multimodal – example:

Page 11: asalaslemand.weebly.comasalaslemand.weebly.com/uploads/3/1/3/1/31310805/lecture... · Web viewLectures 4. Updated: Wed, Sep 17. th. Announcement. Tim Horton’s nutritional information

11

d. If it doesn’t appear to have any peaks, in which all bars are approx. the same height: uniform – example

2. Symmetric or Skewed:a. Draw a vertical line in the middle and fold the graph along

this line so that the edges match closely. Are the values equally distributed on each side of the folded line?

b. Notice the tails (the end of a distribution). If one tail stretches out further than the other, the histogram is said to be skewed to the side of the longer tail.

3. Unusual observation:

Page 12: asalaslemand.weebly.comasalaslemand.weebly.com/uploads/3/1/3/1/31310805/lecture... · Web viewLectures 4. Updated: Wed, Sep 17. th. Announcement. Tim Horton’s nutritional information

12

a. Plotted individually, far away from the rest of the data.b. Potential outlier(s). We can check if they are outliers.c. If small gap from the rest of the data, then it is not that

unusual and probably not an outlier (See the histogram for calories – there is a gap in that distribution).

d. Can affect the statistical methods employed.e. Point them out and explain them.f. In chapter 5. Boxplots

The centre of a distribution:

Page 13: asalaslemand.weebly.comasalaslemand.weebly.com/uploads/3/1/3/1/31310805/lecture... · Web viewLectures 4. Updated: Wed, Sep 17. th. Announcement. Tim Horton’s nutritional information

13

We will use the calories example to calculate the following.

Calories 180 190 190 190 200 200 210 210 210 220 220 230 230 250 250

250 250 270 270 270 270 270 280 280 290 300 300 310 340 400

Average or mean: We sum all of the observations from a particular variable that we are interested in finding its mean, and divide by the total number of cases of the same variable:

Mean = y=∑ yn

=∑ of all values for avariablenumber of cases for that variable

¿calories example : y=180+190+…+40030

=251

- Note that the mean gets influenced by the extremely large or small (unusual) observations. The mean is not resistant (not insensitive – “sensitive”) to extreme values.

Median: The middle value in the sorted data. The 50th percentile.

o In the odd numbered data: n+12 position (the middle number)

o In the even numbered data: n2+( n2+1) position (the average of two

middle numbers). In our example, we have 30 donuts, so it is an even data set. The median is the average of the 15th and the 16th ordered values: (250+250)/2 = 250

o The median is resistant (not sensitive) to values that are extremely large or small. Because the median takes the order of the data values into account and not what the actual values are.

o Note: 50th percentile means that 50% of the data values are below the median and 50% of the data values are above the median.

Comparing mean and median:

Page 14: asalaslemand.weebly.comasalaslemand.weebly.com/uploads/3/1/3/1/31310805/lecture... · Web viewLectures 4. Updated: Wed, Sep 17. th. Announcement. Tim Horton’s nutritional information

14

o In a approx. symmetric distribution, the mean and the median will be close to each other.

o In a skewed distribution: If Mean < Median, the data is left skewed. If Mean > Median, the data is right skewed.

o We note the mean when the distribution is unimodal and symmetric. Eg. Students’ marks

o We note the median for skewed distribution. Eg. Sport salaries.

o StatsCrunch: stat>summary stats>Select Column(s)>Calories In statistics section, I selected the following statistics:

Summary statistics:

Column n Sum Mean Variance Std. dev. Min Max Range Q1 Median Q3 IQR

Calories 30 7530 251 2505.8621 50.058586 180 400 220 210 250 280 70

So, the mean calories is 251 and the median is 250. We can say that the data is about symmetric. Or one can say that the data is slightly right skewed.

Page 15: asalaslemand.weebly.comasalaslemand.weebly.com/uploads/3/1/3/1/31310805/lecture... · Web viewLectures 4. Updated: Wed, Sep 17. th. Announcement. Tim Horton’s nutritional information

15

The spread of a distribution:Calories 180 190 190 190 200 200 210 210 210 220 220 230 230 250 250

250 250 270 270 270 270 270 280 280 290 300 300 310 340 400

Recall the mean: 251 calories

Range: Maximum value – Minimum valueIn our calories example: 400-180 =220

Inter Quartile range(IQR): 75th percentile (Q3 ) – 25TH percentile (Q1)o We can divide the data into quartiles. o The median is the 50th percentile (the second quartile, Q2).

It has half the data below and half the data above it.o The first quartile is Q1, 25th percentile. It has, 1/4 (25%) of

the observations below it and 3/4 (75%) of the observations above it. To find Q1, we search for the middle number(s) in the first half of the data set (below the median).

In our example, Q1 is the 8th position (in the first half of the data set), 210 calories.

o The third quartile is Q3, 75th percentile. It has, 3/4 (75%) of the observations below it and 1/4 (25%) of the observations above it. To find Q3, we search for the middle number(s) in the second half of the data set (above the median).

In our example, Q3 is the 8th position (in the second half of the data set), 280 calories.

o So, we can calculate the IQR = 280 – 210 = 70o The 5-numbers Summary (need for Boxplot - later):

Minimum Maximum Median Q1 Q3In our example: min=180, max=400, median=250, Q1=210, Q3=280

Page 16: asalaslemand.weebly.comasalaslemand.weebly.com/uploads/3/1/3/1/31310805/lecture... · Web viewLectures 4. Updated: Wed, Sep 17. th. Announcement. Tim Horton’s nutritional information

16

Standard Deviation:o How far is each value from the mean (deviation)

From the left, deviations are negative values From the right, deviations are positive When we add up the deviations from left and right,

the positive and negative values cancel each other out. So, we might get a value zero. Not so helpful

The solution is to square the deviations. Recall: Pythagorean theorem

Squaring always gives positive or zero values in a data set. If all the data values are the same, then we have no variation.

We add up the squared deviations and find their average (almost!):

S2=∑ ¿¿¿¿

Variance has a squared measure of units. We take a square root of variance and its positive

sign, to find the spread in the data. This is the standard deviation, denoted by S, and it

has units of measurement (like the unit of the data):

S = + √S2

Standard deviation is not resistant (not insensitive) to outliers. Outliers can have effects on standard devotions, S.

Page 17: asalaslemand.weebly.comasalaslemand.weebly.com/uploads/3/1/3/1/31310805/lecture... · Web viewLectures 4. Updated: Wed, Sep 17. th. Announcement. Tim Horton’s nutritional information

17

In our example:

S2=(180−251)2+(190−251)2+…(400−251)2

30−1S2=2505.8621

S = + √2505.8621=50.06

Summary:Histogram will show us the shape of a distribution. We can see if there are any peaks or not. We can also see if any bins are displayed away from the rest of the data (showing the gaps). As a measure of spread, we use standard deviation, and for describing the centre of a distribution, we refer to the mean.

We use the 5-number summary: min, max, median, Q1, Q3, for displaying a boxplot. For the measure of spread, we use IQR, and for describing the centre of a distribution we use the median. The extreme values are plotted individually with a star.

In general, when we talk about a distribution, we talk about its shape, centre, spread, and note any unusual observations plotted individually.