chapter 1 overview and descriptive statisticsbaek.math.umbc.edu/stat355/ch1.pdfseungchul baek stat...

32
Chapter 1 Overview and Descriptive Statistics Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 1

Upload: others

Post on 19-Jul-2020

29 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

Chapter 1 Overview and Descriptive Statistics

Seungchul Baek

STAT 355 Introduction to Probability and Statistics for Scientists andEngineers

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 1

Page 2: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

Have you ever learned statistics?

learned, and still remember some

learned, but have forgotten everything

Heard but never learned

What is statistics?

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 2

Page 3: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

What Is Statistics?

Statistics measures uncertainty in real life.

Statistics is the science of data; how to interpret data, analyze data,and design studies to collect data.

Statistics is used in all disciplines; not just in engineering and sciences.

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 3

Page 4: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

Statistics Examples

In a reliability (time to failure) study, engineers are interested indescribing the time until failure for a electronic device.

In an agricultural experiment, researchers want to know which of fourfertilizers produces the highest corn yield.

In a clinical trial, physicians want to determine which of two drugs ismore effective for treating HIV in the early stages of the disease.

In a social network analysis, researchers want to know the grouppatterns among all the users.

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 4

Page 5: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

What Statisticians Do Is. . .

Statisticians use their skills in mathematics and computing to formulatestatistical models and analyze data for a specific problem at hand.

Models are then used to estimate important quantities of interest, totest the validity of proposed conjectures, and to predict futurebehavior.

Being able to identify and model sources of variability is an importantpart of statistics.

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 5

Page 6: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

Definitions

Subject: entities that we measure in a study

Population: the total set of subjects in which we are interested in

Sample: the subset of the population for whom we have data, oftenrandomly selected

Variable: any characteristic that is observed for the subject

Statistic: numerical summary of a sample (we know) ex. mean, median,etc.

Parameter: numerical summary of a population (we don’t know) ex. mean,median, variance, etc.

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 6

Page 7: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

Example

Old McDonald’s farm has 5000 turkeys and we’re interested in estimatingthe average weight of all the turkeys. Instead of weighing all 5000, we onlyweigh 100 randomly selected turkeys.

What is subject, population, sample, and variable in this example?

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 7

Page 8: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

Example

Last semester there were 243 STAT 355 students. We wanted toapproximate the average height of a STAT 355 student. So we looked at 40students and measured their height. It showed that the average height ofthe 40 students was 165 cm. After that, we found that the mandatoryphysicals record of all students, in which the average height of all 243 STAT355 students was 172 cm.

What is subject, population, sample, variable, statistic, parameter?

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 8

Page 9: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

Major Components to Statistics

Descriptive Statistics

What summary can help us answer the question?

Inferential Statistics (or Statistical Inference)

Can we predict or draw conclusions based on the data we have?

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 9

Page 10: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

Types of Variables

Variable: any characteristic that is observed for the subject. There are twotypes of variables, categorical variable and quantitative variable.

Categorical: Observations that belong to a set of categories.

ex. hair color, gender, zip code, etc.

Quantitative: Observations that take on numerical values

ex. height, weight, income, etc.

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 10

Page 11: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

Types of Variables

Quantitative: Observations that take on numerical values

Discrete: measured by a whole number

ex. number of books, children, money, etc

Continuous: measured on an interval

ex. time, weight, distances

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 11

Page 12: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

How to Compare Discrete and Continuous

If you think of time: going from 1 min to 2 min we have to hit all of thetimes, e.g. 1.5 min or 1 min 30 sec

If you think of weight: going from 150 lbs to 140 lbs we have to be everyweight between 140 and 150, e.g. 144 lbs

If you think of the number of books and children, we jump from onenumber to the next, 2.5 books, 1.5 children means nothing.

Time and weight are continuous variables. Books and children are discretevariables.

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 12

Page 13: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

Example

Let’s consider a random sample of five residents of Ellicott City.

Days Piercings Gym Type Age Gender1 2 0 No Neither 46 F2 3 1 Yes Run 21 F3 1 0 Yes Run 64 M4 6 2 Yes Both 18 M5 0 0 No Neither 19 F

Days: Number of days spent on workout weekly

Piercings: Number of body piercings

Gym: Do they go to the gym or not?

Type: Do they lift, run, neither or both?

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 13

Page 14: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

Example

Which variables are categorical?

Which variables are quantitative (discrete)?

Which variables are quantitative (continuous)?

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 14

Page 15: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

Categorical Summary: Frequency Table

Let’s say we had 160 people in our sample instead of the 5 in the previousexample and we want to get a better look at the type of workout that aresident of Ellicott City has.

Type Frequency1 Lift 322 Run 643 Both 164 Neither 485 Total 160

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 15

Page 16: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

Categorical Summary: Frequency Table

Type Frequency Relative.Frequency1 Lift 32 0.22 Run 64 0.43 Both 16 0.14 Neither 48 0.35 Total 160 1

The relative frequency is the percent of the total sample, of 160, that hadthe data point we’re looking at.

Relative Frequency = # of subjects in each casetotal # of subjects

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 16

Page 17: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

Graphs

Stem-and-Leaf Plot: Okay when the data is small, retains actual datavalues

Dot Plot: Okay when the data is small and there are relatively few distinctdata values

Pie Chart: Useful when there are a small number of categories

Bar Graph: Useful when there are many categories of the variable, anduseful to compare groups

Histograms: Good for large data and for showing the shape of distribution

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 17

Page 18: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

Shape of Distributions

Symmetric if the right and left sides of the histogram are approximatelymirror images of each other

Skewed to the right (positively skewed) if the right “tail” extends muchfarther out than the left tail

Skewed to the left (negatively skewed) if the left “tail” extends muchfarther out than the right tail

Uniform if all bars are the same height

Bimodal if two (2) bars are higher than others

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18

Page 19: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

Measures of Central Tendency: Mean

The sample mean X̄ of observations x1, . . . , xn is given by

X̄ =∑n

i=1 xin

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 19

Page 20: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

Measures of Central Tendency: Median

The median X̃ is the midpoint of the observations when they are orderedfrom the smallest to largest.

X̃ =

X(m), if n is oddX(m)+X(m+1)

2 , if n is even,

where m = (n + 1)/2 when n is odd and m = n/2 when n is even. X(m)stands for the m-th observation.

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 20

Page 21: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

Measures of Central Tendency: Mode

The mode is the observation that shows up the most in the data set. Modedoes’t necessary exist when we meet tie.

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 21

Page 22: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

Example

We have a date whose size is 14.

2, 7, 7, 11, 12, 15, 14, 20, 5, 6, 15, 12, 12, 20

Mean?

Median?

Mode?

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 22

Page 23: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

Measures of Variability: Range

The range is the difference between the maximum and minimumobservations

It is easy to calculate but relies on only two values, which may beoutliers.

Range = maxi

(xi )−mini

(xi )

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 23

Page 24: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

Measures of Variability: Variance

The sample variance s2 is the average, squared deviation of eachobservation from the mean.

The idea is that it measures the spread of the data about the mean.

It is difficult to interpret because it’s in squared units, cannot benegative and is only zero when all data points are equal.

s2 =∑n

i=1(xi − X̄ )2

n − 1

The sample standard deviation s is the positive square root of thevariance,

s =√

s2

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 24

Page 25: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

Computing Formula for s2

It is not hard to show that

n∑i=1

(xi − X̄ )2 =n∑

i=1x2

i − nX̄ 2

We will encounter a similar thing later:

var(X ) = E{X − E (X )}2 = E (X 2)− {E (X )}2

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 25

Page 26: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

Proposition

We have a sample of x1, . . . , xn and let c be a constant.

If yi = xi + c for i = 1, . . . , n, then s2y = s2

x .

If yi = cxi for i = 1, . . . , n, then s2y = c2s2

x .

These are in fact from the following. We will learn in later chapters.

var(X + c) = var(X )

var(cX ) = c2var(X )

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 26

Page 27: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

Example

We have a data as follows:

0.2, 0.7, 1.1, 1.2, 1.8, 2.3, 9.8, 19.7

What kind of data type is this?

Draw a dot plot.

Draw a stem-and-leaf plot.

Draw a histogram?

Mean, median, and mode?

Range, variance, and standard deviation?

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 27

Page 28: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

Percentiles

Percentile: the p-th percentile is a value such that p percentage of theobservations fall below or at the value.

Consider an ordered population of 10 data values

3, 6, 7, 8, 8, 10, 13, 15, 16, 20

What are the 70th and 15th percentile?

70th percentile = (0.7 * 10)th position = 7th position = 13

15th percentile = (0.15 * 10)th position = 1.5 th position < 2nd position= 6

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 28

Page 29: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

Percentile and Quartile

The term quartile is used because one will be able to divide the data intoquarters

Q1: the observation at the 25th percentile

Q2: the observation at the 50th percentile (Median)

Q3: the observation at the 75th percentile

IQR (Interquartile range)=Q3-Q1: another measure to assess variability.

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 29

Page 30: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

ExampleWe have two samples, one of which is 3, 6, 7, 8, 8, 10, 13, 15, 16, 20 and theother is 3, 6, 7, 8, 8, 10, 13, 15, 16, 20, 40.

1 2

1020

3040

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 30

Page 31: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

Shape of Distributions and Boxplots

Bell shape

Skewed to the right

Skewed to the left

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 31

Page 32: Chapter 1 Overview and Descriptive Statisticsbaek.math.umbc.edu/stat355/ch1.pdfSeungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 18. MeasuresofCentralTendency:

Location of Mean, Median, and Mode

Bell shape

Skewed to the right

Skewed to the left

Seungchul Baek STAT 355 Introduction to Probability and Statistics for Scientists and Engineers 32