2.4 describing distributions numerically

42
2.4 Numerical Summaries of Data Numerical and More Graphical Methods to Describe Univariate Data

Upload: joylyn

Post on 05-Jan-2016

26 views

Category:

Documents


1 download

DESCRIPTION

2.4 Describing Distributions Numerically. Numerical and More Graphical Methods to Describe Univariate Data. 2 characteristics of a data set to measure. center measures where the “middle” of the data is located variability measures how “spread out” the data is. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 2.4 Describing Distributions Numerically

2.4 Numerical Summaries of Data

Numerical and More Graphical Methods to Describe Univariate

Data

Page 2: 2.4 Describing Distributions Numerically

2 characteristics of a data set to measure

center

measures where the “middle” of the data is located

variability

measures how “spread out” the data is

Page 3: 2.4 Describing Distributions Numerically

The median: a measure of center

Given a set of n measurements arranged in order of magnitude,

Median= middle value n odd

mean of 2 middle values, n even

Ex. 2, 4, 6, 8, 10; n=5; median=6 Ex. 2, 4, 6, 8; n=4; median=(4+6)/2=5

Page 4: 2.4 Describing Distributions Numerically

Student Pulse Rates (n=62)

38, 59, 60, 60, 62, 62, 63, 63, 64, 64, 65, 67, 68, 70, 70, 70, 70, 70, 70, 70, 71, 71, 72, 72, 73, 74, 74, 75, 75, 75, 75, 76, 77, 77, 77, 77, 78, 78, 79, 79, 80, 80, 80, 84, 84, 85, 85, 87, 90, 90, 91, 92, 93, 94, 94, 95, 96, 96, 96, 98, 98, 103

Median = (75+76)/2 = 75.5

Page 5: 2.4 Describing Distributions Numerically

Medians are used often Year 2017 baseball salaries

Median $1,562,500 (max=$33,000,000 Clayton Kershaw; min=$535,000)

Median age of TV sports viewers: PGA 64, NASCAR 58, MLB 57, WTA 55, NFL 50; NHL 49, NBA 42, MLS 40

Median existing home sales price: June 2017 $263,800; June 2016 $243,200

US Median household income (2015 dollars) 2015 $56,516; 2014 $53,029

NC Median household income (2015 dollars) 2015 $50,797; 2014 $46,838

Page 6: 2.4 Describing Distributions Numerically

Median Salaries by Major

Page 7: 2.4 Describing Distributions Numerically

The median splits the histogram into 2 halves of equal area

Page 8: 2.4 Describing Distributions Numerically

The median splits the histogram into 2 halves of equal area

Median $25,966

NC $24,358

Page 9: 2.4 Describing Distributions Numerically

Examples Example: n = 7

17.5 2.8 3.2 13.9 14.1 25.3 45.8 Example n = 7 (ordered): 2.8 3.2 13.9 14.1 17.5 25.3 45.8 Example: n = 8

17.5 2.8 3.2 13.9 14.1 25.3 35.7 45.8

Example n =8 (ordered)

2.8 3.2 13.9 14.1 17.5 25.3 35.7 45.8

m = 14.1

m = (14.1+17.5)/2 = 15.8

Page 10: 2.4 Describing Distributions Numerically

10

Think about the median

Six people in a room have a median age of 45 years.

One person who is 40 years old leaves the room.

Question:

What is the median age of the 5 people remaining in the room?

Page 11: 2.4 Describing Distributions Numerically

Below are the annual tuition charges at 7 public universities. What is the median

tuition?

4429496049604971524555467586

1. 5245

2. 4965.5

3. 4960

4. 4971

Page 12: 2.4 Describing Distributions Numerically

Below are the annual tuition charges at 7 public universities. What is the median

tuition?

4429496052455546497155877586

1. 5245

2. 4965.5

3. 5546

4. 4971

Page 13: 2.4 Describing Distributions Numerically

Measures of Spread

The range and interquartile range

Page 14: 2.4 Describing Distributions Numerically

Ways to measure variability

range=largest-smallest OK sometimes; in general, too crude;

sensitive to one large or small data value

The range measures spread by examining the ends of the data

A better way to measure spread is to examine the middle portion of the data

Page 15: 2.4 Describing Distributions Numerically

m = median = 3.4

Q1= first quartile = 2.3

Q3= third quartile = 4.2

1 1 0.62 2 1.23 3 1.64 4 1.95 5 1.56 6 2.17 7 2.38 6 2.39 5 2.510 4 2.811 3 2.912 2 3.313 1 3.414 2 3.615 3 3.716 4 3.817 5 3.918 6 4.119 7 4.220 6 4.521 5 4.722 4 4.923 3 5.324 2 5.625 1 6.1

Quartiles: Measuring spread by examining the middleThe first quartile, Q1, is the value in the

sample that has 25% of the data at or

below it (Q1 is the median of the lower

half of the sorted data).

The third quartile, Q3, is the value in the

sample that has 75% of the data at or

below it (Q3 is the median of the upper

half of the sorted data).

Page 16: 2.4 Describing Distributions Numerically

Quartiles and median divide data into 4 pieces

Q1 M Q3

1/4 1/4 1/4 1/4

Page 17: 2.4 Describing Distributions Numerically

Quartiles are Common Measures of Spread

Page 18: 2.4 Describing Distributions Numerically

Mid-career earnings by major: 25th, 50th, 75th percentiles.

Page 20: 2.4 Describing Distributions Numerically

Rules for Calculating QuartilesStep 1: find the median of all the data (the median divides the data in half)

Step 2a: find the median of the lower half; this median is Q1;Step 2b: find the median of the upper half; this median is Q3.

Important:when n is odd include the overall median in both halves;when n is even do not include the overall median in either half.

Page 21: 2.4 Describing Distributions Numerically

Example 2 4 6 8 10 12 14 16 18 20 n = 10

Median m = (10+12)/2 = 22/2 = 11

Q1 : median of lower half 2 4 6 8 10

Q1 = 6

Q3 : median of upper half 12 14 16 18 20

Q3 = 16

11

Page 22: 2.4 Describing Distributions Numerically

Pulse Rates n = 138

# Stem Leaves4*

3 4. 5889 5* 00123344410 5. 555678889923 6* 0001111112223333334444423 6. 5555666666777778888888816 7* 0000011222233444423 7. 5555566666677788888899910 8* 000011222410 8. 55556677894 9* 00122 9. 584 10* 0223

10.1 11* 1

Median: mean of pulses in locations 69 & 70: median= (70+70)/2=70

Q1: median of lower half (lower half = 69 smallest pulses); Q1 = pulse in ordered position 35;Q1 = 63

Q3 median of upper half (upper half = 69 largest pulses); Q3= pulse in position 35 from the high end; Q3=78

Page 23: 2.4 Describing Distributions Numerically

Below are the weights of 31 linemen on the NCSU football team. What is the

value of the first quartile Q1?

# stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1. 287

2. 257.5

3. 263.5

4. 262.5

Page 24: 2.4 Describing Distributions Numerically

Interquartile range

lower quartile Q1

middle quartile: median upper quartile Q3

interquartile range (IQR)

IQR = Q3 – Q1

measures spread of middle 50% of the data

Page 25: 2.4 Describing Distributions Numerically

Example: beginning pulse rates

Q3 = 78; Q1 = 63

IQR = 78 – 63 = 15

Page 26: 2.4 Describing Distributions Numerically

Below are the weights of 31 linemen on the NCSU football team. The first quartile Q1 is 263.5. What is the value of the IQR?

# stemleaf

2 2255

4 2357

6 2426

7 257

10 26257

12 2759

(4) 281567

15 2935599

10 30333

7 3145

5 32155

2 336

1 340

1. 23.5

2. 39.5

3. 46

4. 69.5

Page 27: 2.4 Describing Distributions Numerically

5-number summary of data

Minimum Q1 median Q3 maximum

Pulse data

45 63 70 78 111

Page 28: 2.4 Describing Distributions Numerically

m = median = 3.4

Q3= third quartile = 4.2

Q1= first quartile = 2.3

25 1 6.124 2 5.623 3 5.322 4 4.921 5 4.720 6 4.519 7 4.218 6 4.117 5 3.916 4 3.815 3 3.714 2 3.613 1 3.412 2 3.311 3 2.910 4 2.89 5 2.58 6 2.37 7 2.36 6 2.15 5 1.54 4 1.93 3 1.62 2 1.21 1 0.6

Largest = max = 6.1

Smallest = min = 0.6

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary:

min Q1 m Q3 max

Boxplot: display of 5-number summary

BOXPLOT

Page 29: 2.4 Describing Distributions Numerically

Boxplot: display of 5-number summary

Example: age of 66 “crush” victims at rock concerts in a recent year.

5-number summary:13 17 19 22 47

Page 30: 2.4 Describing Distributions Numerically

Rock concert deaths: histogram and boxplot

Page 31: 2.4 Describing Distributions Numerically

Boxplot construction1) construct box with ends located at Q1

and Q3; in the box mark the location of median (usually with a line or a “+”)

2) fences are determined by moving a distance 1.5(IQR) from each end of the box;2a) upper fence is 1.5*IQR above the upper quartile

2b) lower fence is 1.5*IQR below the lower quartile

Note: the fences only help with constructing the boxplot; they do not appear in the final boxplot display

Page 32: 2.4 Describing Distributions Numerically

Box plot construction (cont.)3) whiskers: draw lines from the ends of

the box left and right to the most extreme data values found within the fences;

4) outliers: special symbols represent each data value beyond the fences;

4a) sometimes a different symbol is used for “far outliers” that are more than 3 IQRs from the quartiles

Page 33: 2.4 Describing Distributions Numerically

Q3= third quartile = 4.2

Q1= first quartile = 2.3

25 1 7.924 2 6.123 3 5.322 4 4.921 5 4.720 6 4.519 7 4.218 6 4.117 5 3.916 4 3.815 3 3.714 2 3.613 1 3.412 2 3.311 3 2.910 4 2.89 5 2.58 6 2.37 7 2.36 6 2.15 5 1.54 4 1.93 3 1.62 2 1.21 1 0.6

Largest = max = 7.9

Boxplot: display of 5-number summary

BOXPLOT

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile range

Q3 – Q1=4.2 − 2.3 =

1.9

Q3+1.5*IQR=4.2+2.85 = 7.05

1.5 * IQR = 1.5*1.9=2.85. Individual #25 has a value of

7.9 years, so 7.9 is an outlier. The line from the top

end of the box is drawn to the biggest number in the

data that is less than 7.05

Page 34: 2.4 Describing Distributions Numerically

Beg. of class pulses (n=138) Q1 = 63, Q3 = 78 IQR=78 63=15

1.5(IQR)=1.5(15)=22.5

Q1 - 1.5(IQR): 63 – 22.5=40.5

Q3 + 1.5(IQR): 78 + 22.5=100.5

7063 7840.5 100.545

Page 35: 2.4 Describing Distributions Numerically

Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who

gained at least 50 yards. What is the approximate value of Q3 ?

0 136273

410547

684821

9581095

12321369

Pass Catching Yards by Receivers

1. 450

2. 750

3. 215

4. 545

Page 36: 2.4 Describing Distributions Numerically

Careful! Boxplots Do NOT Show Gaps in the Data

Do not rely solely on a boxplot for data exploration

Boxplots are all the same, histograms differ.

Page 37: 2.4 Describing Distributions Numerically

Automating Boxplot Construction

Excel “out of the box” does not draw boxplots.

Many add-ins are available on the internet that give Excel the capability to draw box plots.

SAS, JMP, Minitab, R, etc. all make boxplots (learning curve)

Statcrunch (http://statcrunch.stat.ncsu.edu) makes box plots (no learning curve).

Page 38: 2.4 Describing Distributions Numerically

ATM Withdrawals by Day, Month, Holidays

Page 39: 2.4 Describing Distributions Numerically

Tuition 4-yr Colleges

Page 40: 2.4 Describing Distributions Numerically
Page 41: 2.4 Describing Distributions Numerically
Page 42: 2.4 Describing Distributions Numerically