unit 2: some basics. the whole vs. the part population vs. sample –means (avgs) and “std...

32
Unit 2: Some Basics

Upload: gabriel-hodges

Post on 20-Jan-2018

213 views

Category:

Documents


0 download

DESCRIPTION

Computers are ubiquitous in stats Warning: We’ll compute with Excel because you can see the data and it’s available all over campus... but its stats results are unreliable (see next slide) In the real world, use a statistical program (SAS, SPSS,...)

TRANSCRIPT

Page 1: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…

Unit 2: Some Basics

Page 2: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…

The whole vs. the part• population vs. sample

– means (avgs) and “std devs” [defined later] of these are denoted by different letters

• description vs. inference– mean of a population describes it (partly)– mean of a sample gives a basis to infer

(guess) mean of the population

Page 3: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…

Computers are ubiquitous in stats

• Warning: We’ll compute with Excel because you can see the data and it’s available all over campus...

• but its stats results are unreliable (see next slide)

• In the real world, use a statistical program (SAS, SPSS, ...)

Page 4: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…

Data Types• categorical vs. numerical (vs. ordinal, …)

– ex of categorical: red, blue, green– ex of numerical: person’s height in inches– (ex of ordinal: str agree, agree, neutral, ...)

• [numerical: discrete vs. continuous]

Types of Statistics• graphical

• numerical

Page 5: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…

Graphs for categorical data

• pie charts: for fractions of total data in each category

• bar (Excel: “column”) graphs: for counts or fractions of data in each category

• (mode = most frequent value)

Page 6: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…

Example: Hair color at NYS Fair

Page 7: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…

Numerical variables: Graphs of frequencies I

• dot plots• stem-and-leaf plots• Ex: 79, 91, 59, 52, 94, 74, 75,

87, 67, 35, 91, 89, 96, 92, 92

Page 8: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…

Numerical variables: Graphs of frequencies II

• histogram• 79, 91, 59, 52, 94,

74, 75, 87, 67, 35, 91, 89, 96, 92, 92

• (Note similarity to dotplot and stem-and-leaf plot)

• Excel’s “histograms” aren’t, unless bar widths are equal

Page 9: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…

More on Histograms

• Frequencies are represented by areas, not heights

• Vertical axis should have “density scale” (in % per horizontal unit, not frequency)

• So (if widths differ), bar height is % / h.u.• Result: total area under histogram = 100%

Page 10: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…

Density scale

• Weekly salaries in a company• Vertical axis is in % per $200• Where is the high point?

Page 11: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…

Reading a histogram: Hrs slept by CU students

(Data from questionnaire)

Hrs Areas1,2,3 3x1=3 3/20 = 15%

4,5 2x3=6 6/20 = 30%

6 1x4=4 4/20 = 20%7 1x3=3 3/20 = 15%

8,9 2x2=4 4/20 = 20%

Sum= 20So maybe 5% slept 1hr, 5% 2hr and 5% 3hr;

Or maybe 0% slept 1hr, 7% 2hr, 8% 3hr

1. Which given interval contains the most students?

2. Which 1-hr period contains the most students (i.e., is most “crowded”)?

3. About what % slept 8 hr?

4. About what % slept 3 or 4 hr?

5. If there were 240 surveyed, about how many slept 6-7 hr?

Page 12: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…

Drawing a histogram:Horses’ weights in kgWts (widths) Counts % % / 10 kg180-200 (20) 20 8 4200-220 (20) 80 32 16220-230 (10) 30 12 12230-240 (10) 25 10 10240-260 (20) 40 16 8260-280 (20) 25 10 5

280-310 (30) 30 12 4

Sum = 250

Page 13: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…

Frequencies by area?

Page 14: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…

We are given lists of numbers taken from a large number of residents of New York State. The numbers represent

1.the person’s height

2.the distance from the person’s home to the nearest airport

3.the distance from the person’s home to Syracuse

4.the size of the person’s savings

From the following histograms, choose the one that you think best approximates the true histogram of each list.

Page 15: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…

Remarks on histograms

• If data is continuous, need convention about what do with data on boundaries: Does it fall into left or right interval?

• If data is discrete, it’s natural to center bars on values (which Excel does even when it shouldn’t).

Page 16: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…

Possible traits of a number list• “skewed” right or left (the skew is the tail)

– incomes are usually skewed right– test scores are often skewed left– IQs are usually symmetric

• outlier values (far from others – sometimes a rule defines one, but ...)

• unimodal vs. bimodal – ex: heights of a group of men vs. hts of a

group of men and women

Page 17: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…
Page 18: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…

Centers of numerical data• average [mean] (Greek letter µ or variable

with overbarx ): sum of data divided by number of numbers in list– this is the “arithmetic” mean

• (others include geometric, harmonic)– “center of gravity” of histogram

• median: value that is greater than half, less than half of data– half area of histogram to each side– more “resistant” to skew & outliers

Page 19: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…

Weekly salaries in a small factory: 30 workers

17 $200, 5 $400, 6 $500, 1 $2000, 1 $4600

avg = $500, median = $200

Page 20: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…

Test for skew• median < avg : skewed to right• median > avg : skewed to left

– (These can be used as the def of skew left or right)

Page 21: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…

Pop Quiz

Is the average or the median likely to be greater for a list of

• heights of people (including children)?• heights of adults?• salaries?

Page 22: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…

Measures of spread• standard deviation σ (maybe with subscript of variable) =

√[Σ( x -x )2/n]– this is “population std dev”, SD in text, stdevp in Excel– “sample std dev” s = √[Σ( x -x )2/(n-1)] is SD+ in text

(much later), stdev in Excel • IQR (“inter-quartile range”): difference between third and

first quartiles– first quartile: 1/4 of data is less, 3/4 greater– third quartile: [Guess!]– (n-th percentile: n% of data is less, (100-n)% greater)

• (range: max value – min value)

Page 23: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…

Rough estimates of σ• Steps: Estimate avg, estimate deviations

of data values from avg, estimate avg of deviations (guess high)

• Ex: 41, 48, 50, 50, 54, 57. Is σ closest to 0, 5, 10 or 20?

• Ex: Is σ closest to 0, 3, 8 or 15?

Page 24: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…

Rule of Thumb (best for, but not limited to, bell-shaped data)

• 68% of values are within 1σ of mean• 95% of values are within 2σ of mean• almost all values are within 3σ of mean

Page 25: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…

• So values within one σ of the average are fairly common, ...

• while values around two σ away from average are mildly surprising, ...

• and values more than three σ away from the average are “a minor miracle”.

• “z-score” (“std units”): z = ( x –x ) / σ – the number of σ’s above average– (if negative, below average)

Page 26: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…

Effect of outliers

• 1,2,3,3,4,5– avg = 3– σ = √(10/6) ≈ 1.3– median = 3– IQR = 3.75 - 2.25

(Excel values) = 1.5

• 1,2,3,3,4,100– avg ≈ 18.8– σ ≈ 36.3– median = 3– IQR = 3.75 - 2.25

(Excel values) = 1.5

Page 27: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…

Changing the list (I):“Linear” changes of variable

(i.e., changing units of measure)

• Basic list: 6, 9, 15– µ = [6+9+15]/3 = 10– σ = √[((6-10)2+(9-10)2+(15-10)2)/3] = √14

• Add 5: 11, 14, 20– µ = [11+14+20]/3 = 15– σ = √[((11-15)2+(14-15)2+(20-15)2)/3)] = √14

• Multiply by 12: 72, 108, 180– µ = [72+108+180]/3 = 120– σ = √[((72-120)2+(108-120)2+(180-120)2)/3] = 12√14

Page 28: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…

Changing the list (II):

• A: 25, 35 B: 40, 40, 50, 50– µA = 30, σA = 5, µB = 45, σB = 5

• A and B: 25, 35, 40, 40, 50, 50– µ = 40, σ = 5√3 > 5

Combining two lists

• Adjoining copies of the average: 6, 9, 15, 10, 10, 10, 10

– µ = (6+9+15+4(10))/7 = 10

– σ = √[((6-10)2+(9-10)2+(15-10)2+4(10-10)2)/7]

= √(3/7)∙√14 < √14

Page 29: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…

Box-&-whisker plots• Uses all of min, first quartile, median, third

quartile, max• “Normal” just shows them all• “Modified” puts limit on whisker length -- 1.5 IQR

– 3rd quartile +1.5 IQR = “(inner) fence”• whisker ends at last value before or on fence

– beyond the fence is an “outlier”• reject only “for cause” (?)

– beyond 3rd quartile + 3 IQR (“outer fence” or “bound”) is “extreme outlier”

Page 30: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…
Page 31: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…
Page 32: Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs”…

Normal:

Modified: