1/2/2014 (c) 2000, ron s. kenett, ph.d.1 understanding variability instructor: ron s. kenett email:...

Post on 26-Mar-2015

212 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 1

Understanding Variability

Instructor: Ron S. KenettEmail: ron@kpa.co.il

Course Website: www.kpa.co.il/biostatCourse textbook: MODERN INDUSTRIAL STATISTICS,

Kenett and Zacks, Duxbury Press, 1998

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 2

Course Syllabus

•Understanding Variability•Variability in Several Dimensions•Basic Models of Probability•Sampling for Estimation of Population Quantities•Parametric Statistical Inference•Computer Intensive Techniques•Multiple Linear Regression•Statistical Process Control•Design of Experiments

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 3

Discrete DataDiscrete Data

A set of data is said to be discrete if the values / observations belonging to it are distinct and separate. That is, they can be counted (1,2,3,.......). For example, the number of kittens in a litter; the number of patients in a doctors surgery; the number of flaws in one metre of cloth; gender (male, female); blood group (O, A, B, AB).

Discrete DataDiscrete Data

A set of data is said to be discrete if the values / observations belonging to it are distinct and separate. That is, they can be counted (1,2,3,.......). For example, the number of kittens in a litter; the number of patients in a doctors surgery; the number of flaws in one metre of cloth; gender (male, female); blood group (O, A, B, AB).

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 4

Continuous DataContinuous Data

A set of data is said to be continuous if the values / observations belonging to it may take on any value within a finite or infinite interval. You can count, order and measure continuous data. For example, height; weight; temperature; the amount of sugar in an orange; the time required to run a mile.

Continuous DataContinuous Data

A set of data is said to be continuous if the values / observations belonging to it may take on any value within a finite or infinite interval. You can count, order and measure continuous data. For example, height; weight; temperature; the amount of sugar in an orange; the time required to run a mile.

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 5

Types of VariablesTypes of Variables

Qualitative Variables Attributes, categories

Examples: male/female, registered to vote/not, ethnicity, eye color....

Quantitative Variables Discrete - usually take on integer values but

can take on fractions when variable allows - counts, how many

Continuous - can take on any value at any point along an interval - measurements, how much

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 6

For each of the following, indicate whether the appropriate

variable would be qualitative or quantitative.

If the variable is quantitative, indicate whether it would be discrete or continuous.

Self Assessment TestSelf Assessment Test

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 7

Self Assessment TestSelf Assessment Test

a) Whether you own an RCA Colortrak television set

b) Your status as a full-time or a part-time student

c) Number of people who attended your school’s graduation last year

Qualitative Variable two levels: yes/no no measurement

Qualitative Variable two levels: full/part no measurement

Quantitative, Discrete Variable a countable number only whole numbers

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 8

Self Assessment TestSelf Assessment Test

d) The price of your most recent haircut

e) Sam’s travel time from his dorm to the Student Union

Quantitative, Discrete Variable a countable number only whole numbers

Quantitative, Continuous Variable any number time is measured can take on any value

greater than zero

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 9

Self Assessment TestSelf Assessment Test

f) The number of students on campus who belong to a social fraternity or sorority

Quantitative, Discrete Variable a countable number only whole numbers

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 10

Scales of MeasurementScales of Measurement

Nominal Scale - Labels represent various levels of a categorical variable.

Ordinal Scale - Labels represent an order that indicates either preference or ranking.

Interval Scale - Numerical labels indicate order and distance between elements. There is no absolute zero and multiples of measures are not meaningful.

Ratio Scale - Numerical labels indicate order and distance between elements. There is an absolute zero and multiples of measures are meaningful.

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 11

Self Assessment TestSelf Assessment Test

Bill scored 1200 on the Scholastic Aptitude Test and entered college as a physics major. As a freshman, he changed to business because he thought it was more interesting. Because he made the dean’s list last semester, his parents gave him $30 to buy a new Casio calculator. Identify at least one piece of information in the:

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 12

Self Assessment TestSelf Assessment Test

a) nominal scale of measurement.

1. Bill is going to college.2. Bill will buy a Casio calculator.3. Bill was a physics major.4. Bill is a business major.5. Bill was on the dean’s list.

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 13

Self Assessment TestSelf Assessment Test

b) ordinal scale of measurement

c) interval scale of measurement

d) ratio scale of measurement

Bill is a freshman.

Bill earned a 1200 on the SAT.

Bill’s parents gave him $30.

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 14

Self Assessment TestSelf Assessment Test

b) ordinal scale of measurement

c) interval scale of measurement

d) ratio scale of measurement

Bill is a freshman.

Bill earned a 1200 on the SAT.

Bill’s parents gave him $30.

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 15

HistogramHistogram

A histogram is a way of summarising data that are measured on an interval scale (either discrete or continuous). It is often used in exploratory data analysis to illustrate the major features of the distribution of the data in a convenient form. It divides up the range of possible values in a data set into classes or groups. For each group, a rectangle is constructed with a base length equal to the range of values in that specific group, and an area proportional to the number of observations falling into that group. This means that the rectangles might be drawn of non-uniform height.

HistogramHistogram

A histogram is a way of summarising data that are measured on an interval scale (either discrete or continuous). It is often used in exploratory data analysis to illustrate the major features of the distribution of the data in a convenient form. It divides up the range of possible values in a data set into classes or groups. For each group, a rectangle is constructed with a base length equal to the range of values in that specific group, and an area proportional to the number of observations falling into that group. This means that the rectangles might be drawn of non-uniform height.

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 16

Data array An orderly presentation of data in

either ascending or descending numerical order.

Frequency Distribution A table that represents the data in

classes and that shows the number of observations in each class.

Key TermsKey Terms

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 17

Key TermsKey Terms

Frequency Distribution Class - The category Frequency - Number in each class Class limits - Boundaries for each

class Class interval - Width of each class Class mark - Midpoint of each class

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 18

SturgeSturge’’s Rules Rule

How to set the approximate number of classes to begin constructing a frequency distribution.

where k = approximate number of classes to use and

n = the number of observations in the data set .

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 19

Frequency DistributionsFrequency Distributions

1. Number of classes Choose an approximate number of classes for your

data. Sturges’ rule can help.

2. Estimate the class interval Divide the approximate number of classes (from

Step 1) into the range of your data to find the approximate class interval, where the range is defined as the largest data value minus the smallest data value.

3. Determine the class intervalRound the estimate (from Step 2) to a convenient value.

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 20

Frequency DistributionsFrequency Distributions

4. Lower Class LimitDetermine the lower class limit for the first class by selecting a convenient number that is smaller than the lowest data value.

5. Class LimitsDetermine the other class limits by repeatedly adding the class width (from Step 2) to the prior class limit, starting with the lower class limit (from Step 3).

6. Define the classesUse the sequence of class limits to define the classes.

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 21

Relative Frequency DistributionsRelative Frequency Distributions

1. Retain the same classes defined in the frequency distribution.

2. Sum the total number of observations across all classes of the frequency distribution.

3. Divide the frequency for each class by the total number of observations, forming the percentage of data values in each class.

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 22

Cumulative Relative Frequency Cumulative Relative Frequency DistributionsDistributions

1. List the number of observations in the lowest class.

2. Add the frequency of the lowest class to the frequency of the second class. Record that cumulative sum for the second class.

3. Continue to add the prior cumulative sum to the frequency for that class, so that the cumulative sum for the final class is the total number of observations in the data set.

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 23

Cumulative Relative Frequency Cumulative Relative Frequency DistributionsDistributions

4 .Divide the accumulated frequencies for each class by the total number of observations -- giving you the percent of all observations that occurred up to an including that class.

An Alternative: Accrue the relative frequencies for each class instead of the raw frequencies. Then you don’t have to divide by the total to get percentages.

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 24

ExampleExample

The average daily cost to community hospitals for patient stays during 1993 for each of the 50 U.S. states was given in the next table. a) Arrange these into a data array. b) Construct a stem-and-leaf display. *) Approximately how many classes would be

appropriate for these data? c & d) Construct a frequency distribution. State

interval width and class mark. e) Construct a histogram, a relative frequency

distribution, and a cumulative relative frequency distribution.

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 25

Example Example ––Data ListData List

AL $775 HI 823 MA 1,036 NM 1,046 SD 506

AK 1,136 ID 659 MI 902 NY 784 TN 859

AZ 1,091 IL 917 MN 652 NC 763 TX 1,010

AR 678IN 898 MS 555 ND 507 UT 1,081CA 1,221 IA 612 MO 863 OH 940 VT

676CO 961 KS 666 MT 482 OK 797 VA

830CT 1,058 KY 703 NE 626 OR 1,052 WA

1,143DE 1,024 LA 875 NV 900 PA 861 WV

701FL 960 ME 738 NH 976 RI 885 WI

744GA 775 MD 889 NJ 829 SC 838 WY

537

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 26

Example Example –– Data Array Data Array

CA 1,221TX 1,010 RI 885 NY 784 KS 666WA 1,143NH 976 LA 875 AL 775 ID 659AK 1,136CO 961 MO 863 GA 775 MN 652AZ 1,091FL 960 PA 861 NC 763 NE 626UT 1,081CH 940 TN 859 WI 744 IA 612CT 1,058 IL 917 SC 838 ME 738 MS 555OR 1,052 MI 902 VA 830 KY 703 WY

537NM 1,046NV 900 NJ 829 WV 701 ND 507MA 1,036 IN 898 HI 823 AR 678 SD 506DE 1,024MD 889 OK 797 VT 676 MT 482

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 27

Example Example –– Stem and Leaf Stem and Leaf DisplayDisplayStem-and-Leaf Display N = 50Leaf Unit: 100

1 12 21 2 11 43, 36 8 10 91, 81, 58, 52, 46, 36, 24, 10 7 9 76, 61, 60, 40, 17, 02, 00(11) 8 98, 89, 85, 75, 63, 61, 59, 38, 30, 29, 23 9 7 97, 84, 75, 75, 63, 44, 38, 03, 01 7 6 78, 76, 66, 59, 52, 26, 12 4 5 55, 37, 07, 06 1 4 82

Range: $482 - $1,221

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 28

To approximate the number of classes we should use in creating the frequency distribution, use Sturges’ Rule, n = 50:

Sturges’ rule suggests we use approximately 7 classes.

k13.322(log10

n)13.322(log10

50)

13.322(1.69897)15.6446.6447

Example Example –– Frequency Frequency DistributionDistribution

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 29

Step 1. Number of classes Sturges’ Rule: approximately 7

classes.

The range is: $1,221 – $482 = $739

$739/7 $106 and $739/8 $92 Steps 2 & 3. The Class

Interval So, if we use 8 classes, we can make

each class $100 wide.

Example Example –– Frequency Frequency DistributionDistribution

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 30

Example Example –– Frequency Frequency DistributionDistribution

Step 1. Number of classes Sturges’ Rule: approximately 7

classes.

The range is: $1,221 – $482 = $739

$739/7 $106 and $739/8 $92 Steps 2 & 3. The Class

Interval So, if we use 8 classes, we can make

each class $100 wide.

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 31

Example Example –– Frequency Frequency DistributionDistribution

Step 4. The Lower Class Limit If we start at $450, we can cover the range in 8

classes, each class $100 in width.The first class : $450 up to $550

Steps 5 & 6. Setting Class Limits$450 up to $550 $850 up to $950$550 up to $650 $950 up to $1,050$650 up to $750 $1,050 up to $1,150$750 up to $850 $1,150 up to $1,250

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 32

Example Example –– Frequency Frequency DistributionDistribution

Average daily cost NumberMark

$450 – under $550 4 $500 $550 – under $650 3 $600 $650 – under $750 9 $700 $750 – under $850 9 $800 $850 – under $950 11 $900 $950 – under $1,050 7 $1,000$1,050 – under $1,150 6 $1,100$1,150 – under $1,250 1 $1,200

Interval width: $100

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 33

Example Example –– Histogram Histogram

0

2

4

6

8

10

12

500 600 700 800 900 1000 1100 1200

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 34

Example Example –– Relative Frequency Relative Frequency DistributionDistribution

Average daily cost Number Rel. Freq. $450 – under $550 4 4/50 = .08 $550 – under $650 3 3/50 = .06 $650 – under $750 9 9/50 = .18 $750 – under $850 9 9/50 = .18 $850 – under $950 11 11/50 = .22 $950 – under $1,050 7 7/50 = .14$1,050 – under $1,150 6 6/50 = .12$1,150 – under $1,250 1 1/50 = .02

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 35

Example Example –– Polygon Polygon

0

0.05

0.1

0.15

0.2

0.25

0 200 400 600 800 1000 1200 1400

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 36

Example Example –– Cumulative Cumulative Frequency DistributionFrequency Distribution

Average daily cost Number Cum. Freq. $450 – under $550 4 4 $550 – under $650 3 7 $650 – under $750 9 16 $750 – under $850 9 25 $850 – under $9 11 36 $950 – under $1,050 7 43$1,050 – under $1,150 6 49$1,150 – under $1,250 1 50

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 37

Example Example –– Cumulative Relative Cumulative Relative Frequency DistributionFrequency Distribution

Average daily cost Cum.Freq. Cum.Rel.Freq. $450 – under $550 4 4/50 = .02 $550 – under $650 7 7/50 = .14 $650 – under $750 16 16/50 = .32 $750 – under $850 25 25/50 = .50 $850 – under $950 36 36/50 = .72 $950 – under $1,050 43 43/50 = .86$1,050 – under $1,150 49 49/50 = .98$1,150 – under $1,250 50 50/50 = 1.00

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 38

Example Example –– Percentage Ogive Percentage Ogive

0

5

10

15

20

25

30

35

40

45

50

0 200 400 600 800 1000 1200

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 39

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 40

Key TermsKey Terms

Measures of Central Tendency,

The Center

Mean µ, population; , sample

Weighted Mean Median Mode

x

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 41

Key TermsKey Terms

Measures of Dispersion,

The Spread

Range Mean absolute deviation Variance Standard deviation Interquartile range Interquartile deviation Coefficient of variation

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 42

Key TermsKey Terms

Measures of Relative Position

Quantiles Quartiles Deciles Percentiles

Residuals Standardized values

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 43

The MeanThe Mean

Mean Arithmetic average = (sum all values)/# of

values Population: µ = (xi)/N Sample: = (xi)/n

Problem: Calculate the average number of truck shipments from the United States to five Canadian cities for the following data given in thousands of bags:

Montreal, 64.0; Ottawa, 15.0; Toronto, 285.0; Vancouver, 228.0; Winnipeg, 45.0

(Ans: 127.4)

x

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 44

The Weighted MeanThe Weighted Mean

When what you have is grouped data, compute the mean using µ = (wixi)/wi

Problem: Calculate the average profit from truck shipments, United States to Canada, for the following data given in thousands of bags and profits per thousand bags:Montreal 64.0 Ottawa 15.0 Toronto 285.0

$15.00 $13.50 $15.50

Vancouver 228.0 Winnipeg 45.0 $12.00 $14.00

(Ans: $14.04 per thous. bags)

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 45

The MedianThe Median

To find the median:

1. Put the data in an array.2A. If the data set has an ODD number of numbers, the

median is the middle value.2B. If the data set has an EVEN number of numbers,

the median is the AVERAGE of the middle two values.(Note that the median of an even set of data values is not necessarily a member of the set of values.)

The median is particularly useful if there are outliers in the data set, which otherwise tend to sway the value of an arithmetic mean.

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 46

The ModeThe Mode

The mode is the most frequent value. While there is just one value for the

mean and one value for the median, there may be more than one value for the mode of a data set.

The mode tends to be less frequently used than the mean or the median.

0

2

4

6

8

10

12

500 600 700 800 900 1000 1100 1200

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 47

Comparing Measures of Central Comparing Measures of Central TendencyTendency

If mean = median = mode, the shape of the distribution is symmetric.

If mode < median < mean or if mean > median > mode,the shape of the distribution trails to the right,is positively skewed.

If mean < median < mode or if mode > median > mean,the shape of the distribution trails to the left,is negatively skewed.

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 48

The RangeThe Range

The range is the distance between the smallest and the largest data value in the set.

Range = largest value – smallest value

Sometimes range is reported as an interval, anchored between the smallest and largest data value, rather than the actual width of that interval.

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 49

ResidualsResiduals

Residuals are the differences between each data value in the set and the group mean: for a population, xi – µ for a sample, xi – x

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 50

The MADThe MAD

The mean absolute deviation is found by summing the absolute values of all residuals and dividing by the number of values in the set:for a population, MAD = (|xi – µ|)/Nfor a sample, MAD = (|xi – |)/n

x

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 51

The VarianceThe Variance

Variance is one of the most frequently used measures of spread,

for population,

for sample,

The right side of each equation is often used as a computational shortcut.

2 (x

i–)2

N(x

i)2 – N2

N

s2(x

i– x )2

n–1(x

i)2 –nx 2

n–1

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 52

The Standard DeviationThe Standard Deviation

Since variance is given in squared units, we often find uses for the standard deviation, which is the square root of variance: for a population,

for a sample,

2

s s2

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 53

QuartilesQuartiles

One of the most frequently used quantiles is the quartile.

Quartiles divide the values of a data set into four subsets of equal size, each comprising 25% of the observations.

To find the first, second, and third quartiles: 1. Arrange the N data values into an array. 2. First quartile, Q1 = data value at position (N +

1)/4 3. Second quartile, Q2 = data value at position 2(N

+ 1)/4 4. Third quartile, Q3 = data value at position 3(N +

1)/4

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 54

QuartilesQuartiles

0.0 1.5 3.0 4.5 6.0

0

25

50

75

100

Ln_YarnS

Cum

ula

tive

Fre

que

ncy

Q1 Q2 Q3

0.0 1.5 3.0 4.5 6.0

0

25

50

75

100

Ln_YarnS

Cum

ula

tive

Fre

que

ncy

Q1 Q2 Q3

04/10/23

(c) 2000, Ron S. Kenett, Ph.D. 55

Standardized ValuesStandardized Values

How far above or below the individual value is compared to the population mean in units of standard deviation “How far above or below” (data value –

mean) which is the residual... “In units of standard deviation” divided by

Standardized individual value: A negative z means the data value falls below the

mean.

x– z

top related