chapter-2 statistical description of quantitative variable

Post on 28-Dec-2015

221 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Chapter-2 Statistical Chapter-2 Statistical

description of quantitative description of quantitative

variablevariable

Teaching contents

In this section, we shall study descriptive

techniques of quantitative variable.

Section 1 Frequency distribution table and

frequency distribution graph

Section 2 Measures of central tendency

Section 3 Measures of dispersion tendency

Teaching aimsTeaching aims

To learn the usage of frequency table

and graph.

To master the application of different

indexes.

Department of Health Statistics

Section 1 Frequency distribution

table and frequency distribution

graph

part 1 Frequency distribution table and

graph of qualitative variable

part 2 Frequency distribution table and

graph of quantitative variable

part 3 Usage of frequency distribution

graph

Department of Health Statistics

NEXT

[Example 1.1] university officials

periodically review the distribution of

undergraduate majors to help determine

a fair allocation of resources , and the

following data were obtained

college Number of majors

agriculture 1500

Arts and sciences 3200

education 1200

Engineering 4100

Department of Health Statistics

Table 1.1 the distribution of undergraduate majors

Department of Health Statistics

backFig 1.1 the distribution of undergraduate majors

number of maj ors

0

1000

2000

3000

4000

5000

engi neeri ng arts andsci ences

agri cul ture educati on

  [Example 1. 2 ] The techniques will be

illustrated using the Scottish Heart

Health Study, but for simplicity we shall

now take only one variable recorded on

50 subjects.

Department of Health Statistics

Department of Health Statistics

5.75 6.29 6.13 6.78 6.46

6.76 5.98 6.25 6.31 5.99

6.47 5.71 5.19 4.35 5.35

7.11 6.89 6.05 7.01 5.86

5.42 4.92 7.12 5.85 5.64

7.04 6.23 5.71 6.74 6.36

5.75 7.71 6.19 7.55 6.76

7.14 5.73 6.73 7.86 5.51

6.02 6.54 5.34 6.92 7.15

6.55 7.16 4.79 6.64 6.83

Table 1.2 Serum total cholesterol (mmol/L) of 50 subjects from the Scottish Heart Health

Study

How to describe the data in table 1.2?How to describe the data in table 1.2?

List all the data one by one, but it is

difficult for the reader to learn the

distribution character of 50 individuals.

Summarize it using specific index, which

is economical in space and easier for the

reader to understand.

Department of Health Statistics

FREQUENCY DISTRIBUTION TABLE and FREQUENCY DISTRIBUTION TABLE and

FREQUENCY DISTRIBUTION GRAPHFREQUENCY DISTRIBUTION GRAPH

Step 1 to find MIN and MAX, and

compute range

Step 2 set up class intervals

Step 3 set all the data in one of the

class intervals

Department of Health Statistics

MIN 4.35

MAX 7.86

RANGE 3.51

Range is the difference between MAX

and MIN

Department of Health Statistics

Step 1

Divide the range by the approximate

number of class intervals.

Generally we will wish to have 7 to 15

class intervals, which is related with

sample size. The larger sample size is,

the more class intervals there are

accordingly.

Department of Health Statistics

Step 2

Suppose we wish to have 7 class

intervals, then the interval width is

3.51(range)/7 ≈ 0.5

So we choose 0.5 as the interval

width .

Department of Health Statistics

Step 2

Divide the range by the desired

number of subintervals.

Department of Health Statistics

Step 2

Your attention: The first subinterval

must contain MIN, and the last one

must include MAX.

Construct frequency distribution and

keep a tally of the number of

measurements falling in a each

interval.

Department of Health Statistics

Step 3

Your attention: Each class interval

include the lower limit (L), but not

the upper limit (U).

For example, there is a data of 5.5,

it should be in the forth group.

Department of Health Statistics

Step 3

Cholesterol

(mmol/L)

4.0-4.5

4.5-5.0

5.0-5.5

5.5-6.0

6.0-6.5

6.5-7.0

7.0-7.5

7.5-8.0

Department of Health Statistics

Lower limit

Cholesterol

(mmol/L) mark Frequency percentage

Cumulative

percentage

4.0-4.5 | 1 2% 2%

4.5-5.0 | | 2 4% 6%

5.0-5.5 | | | | 4 8% 14%

5.5-6.0 | | | | | | | | | | | 11 22% 36%

6.0-6.5 | | | | | | | | | | | 11 22% 58%

6.5-7.0 | | | | | | | | | | | 11 22% 80%

7.0-7.5 | | | | | | | 7 14% 94%

7.5-8.0 | | | 3 6% 100%

total 50 100%

Upper limit

Table 1.3 frequency distribution table for serum total cholesterol

Percentage is frequency divided by sample size(50)

Department of Health Statistics

Serum Cholesterol

7.757.256.756.255.755.254.754.25

frequency

12

10

8

6

4

2

0

Std. Dev = .76

Mean = 6.29

N = 50.00

3

7

111111

4

2

1

Fig 1.2 frequency distribution graph for serum total cholesterol

Department of Health Statistics

Serum Cholesterol

7.757.256.756.255.755.254.754.25

frequency

12

10

8

6

4

2

0

Std. Dev = .76

Mean = 6.29

N = 50.00

3

7

111111

4

2

1

number of maj ors

0

1000

2000

3000

4000

5000

engi neeri ng arts andsci ences

agri cul ture educati on

The difference

Usage of frequency distribution Usage of frequency distribution graph graph

1 To describe the distribution

characters of frequency.

From table 3 and figure 2, we can know

serum total cholesterol of most people

is from 5.0 to 7.0 mol/L, the proportion

beyond is very small.

Department of Health Statistics

How to describe the distribution How to describe the distribution characters of data?characters of data?

Central tendency

Dispersion tendency

Department of Health Statistics

Serum Cholesterol

7.757.256.756.255.755.254.754.25

frequency

12

10

8

6

4

2

0

Std. Dev = .76

Mean = 6.29

N = 50.00

3

7

111111

4

2

1

Department of Health Statistics

Describe How Data Are Distributed

Positive-SkewedNegative-Skewed Symmetric

Mercury

concentrati on

(g/g) number

<0. 3 3

0. 3~ 17

0. 7~ 66

1. 1~ 60

1. 5~ 48

1. 9~ 18

2. 3~ 16

2. 7~ 6

3. 1~ 1

3. 5~ 1

3. 9~ 2

total 238

Table 2 Mercury concentrationOf hair in 238 health people

0

10

20

30

40

50

60

70

0. 3< 0. 3~ 0. 7~ 1. 1~ 1. 5~ 1. 9~ 2. 3~ 2. 7~ 3. 1~ 3. 5~ 3. 9~

ug/ g发汞值( )

人数

Mercury concentration

Of hair

num

be

r

Positive-Skewed

table3 Myoglobin concentrationin blood serum of 101 normal people

Myogl obi n

concentrati on

(g/ ml )

number

0~ 2

5~ 3

10~ 7

15~ 9

20~ 10

25~ 22

30~ 23

35~ 14

40~ 9

45~50 2

101 0

5

10

15

20

25

0~ 5~ 10~ 15~ 20~ 25~ 30~ 35~ 40~ 45~

ug/ ml血清肌红蛋白( )

人数

num

be

r

Negative-Skewed

Myoglobin concentrationIn blood serum

2 From the frequency distribution, we can

find the outlier ( too large or too small value)

very easily.

For instance, all the serum total cholesterol

is from 4.0 to 8.0, if one value is 28 (too

large, we think it’s impossible) , we called it

outlier and should check whether it is right.

3 It is a way of describing data.

Department of Health Statistics

Department of Health Statistics

Section 2 Measures of

central tendency

arithmetic mean

geometric mean

Median and Percentile

Mode

2

1

3

4

Cen

tral te

nden

cy

Central tendency reflects the average

level of a series of measurements.

The arithmetic meanThe arithmetic mean

[Definition] The arithmetic mean,

also called mean, is defined to be the

sum of the measurements divided by

the total number measurements.

Department of Health Statistics

[symbols] the population mean is denoted by the Greek letter μ (read “mu”) and the sample mean is denoted by the symbol (read “X-bar”)

[Sample mean]

X

n

XX

Department of Health Statistics

n is the total number of observations.

X is a particular value.

(read “sigma”) indicates the operation

of adding.

mean

N

X[Population Mean][Population Mean]

[example2.1] The mean score on a given

test can be found for an entire class. Take

a look at this American History class :

Department of Health Statistics

mean

[solution] We find the mean score, by

adding all the scores together and

dividing by 10 (the number of

scores).

4.8210

85...7590

n

XX

Department of Health Statistics

mean

Department of Health Statistics

All the values are included while

computing the mean.

The mean is easily affected by largest

or smallest values.

mean

[ Properties of the Arithmetic Mean][ Properties of the Arithmetic Mean]

0)( XX

Department of Health Statistics

[notice]

Mean can only be used in homogenous

data.

For example, we can compute the mean

height of ten-year-old boys. But it is

unscientific to calculate the mean height

of boys from 1 to 14 years.

Only when the distribution is normal, can

we compute mean.

mean

Department of Health Statistics

mean

Mean can be

used.

Department of Health Statistics

Geometric MeanGeometric Mean

[Definition]

The geometric mean is defined as the

nth root of the product of the n

numbers.

[symbol] G

Geometric MeanGeometric Mean

[formula][formula]

)lg

(1lg

lg

lg2

lg1

lg)21

lg(lg

21

n

XG

n

Xn

nXXX

nnXXX

G

or

nnXXXG

Department of Health Statistics

Geometric MeanGeometric Mean

Department of Health Statistics

[Example 2.3] The antibody’s levels of

serum of six patients are listed.

1:10 , 1:20 , 1:40 , 1:80 , 1:80 , 1:1

60,

Please calculate the geometric mean?

Geometric MeanGeometric Mean

[solution][solution]

Department of Health Statistics

Geometric MeanGeometric Mean

45)6522.1(lg

)6

160lg...20lg10lg(lg

)lg

(lg

1

1

1

n

XG

So the Geometric Mean is 1:45

X is reciprocal of antibody’s level; and lgX is the logarithm of reciprocal.

Sample size

Inverse logarithm

Department of Health Statistics

[Usage of G ]

Geometric mean is often used in

geometric proportion data.

Such as 1:2 1:4 1:8 1:16 1:32

Geometric MeanGeometric Mean

Median

[Definition]

The median, also called 50th percentile,

is the midpoint of the observations when

they are arranged in ascending order.

Department of Health Statistics

median

[formula][formula]

When n is odd, the median is still the middle value when the data are arranged in ascending order.

)(2

11

22

nn XXM2

1 nXM

Department of Health Statistics

When n is even, the

median is the mean

of the middle two

values when the data

are arranged in

ascending order.

.

median

2/)(1

22

nn XXM

[Example 2.5][Example 2.5]

Each of 7children in the second grade

was given a reading aptitude test, the

scores were as shown below.

95 86 64 81 75 76 69

Determine the median test score.

Department of Health Statistics

median

[solution][solution]

Firstly, we must arrange the scores in

ascending order

64 69 75 76 81 86 95

There are 7 measurements, and the

forth is the midpoint value, so the

median is 76, or we can use formula

764

2

1 XXM n

Department of Health Statistics

median

[Example 2.6][Example 2.6]

An experiment was conducted to measure the

effectiveness of a new procedure pruning grapes.

10 were assigned the task of pruning an acre of

grapes. The productivity, measured in worker-

hours/acre, is recorded for each person

4.4 4.9 3.8 5.2 4.7 4.6 5.4 3.8 4.0 4.3

Determine the median productivity for the group.

Department of Health Statistics

median

[solution][solution]

Arrange the data in ascending order

3.8 3.8 4.0 4.3 4.4 4.6 4.7 4.9 5.2 5.4

Compute the mean of the 5th and 6th

5.42/)(2/)( 65

21

2

XXXXMnn

Department of Health Statistics

median

[exercise][exercise]

Exercise capacity (in seconds) was

determined for each of 11 patients

being treated for chronic heart failure.

Department of Health Statistics

906 684 897 1320 1200 882

711 837 1008 1170 1056

Determine the median and mean.

median

Answer

Mean 970

Median 906

When sample size is very larger or to

the grouped data, we can chose other

formula to compute median(P50).

Department of Health Statistics

median

Min

P0

Max

P100X% ( 100-X )

%

Px

M

P50

)%( Lx

x fnxf

iLP

)%50(50 Lm

fnf

iLP

fx=frequency of the group including median

I = interval width

L: lower limit of the group including median.

is the cumulative frequency less than

the group including median.

)%50(50 Lm

fnf

iLP

Lf

Department of Health Statistics

median

[Example 2.7 ][Example 2.7 ]

Determine the median in example 1.2

Department of Health Statistics

median

Department of Health Statistics

Lower limit

Cholesterol

(mmol/L) frequence percentage

Cumulative

frequence

Cumulative

percentage

4.0-4.5 1 2% 1 2%

4.5-5.0 2 4% 3 6%

5.0-5.5 4 8% 7 14%

5.5-6.0 11 22% 18 36%

6.0-6.5 11 22% 29 58%

6.5-7.0 11 22% 40 80%

7.0-7.5 7 14% 47 94%

7.5-8.0 3 6% 50 100%

total 50 100%

Upper limit

median

Department of Health Statistics

To determine which interval the median

belongs to

we must find the first interval for

which the cumulative frequency

reaches 0.50. This interval will be the

one containing the median.

median

For these data, the interval from 6.0

to 6.5 is the first interval for which the

cumulative frequency reaches 0.50, as

shown in the table, column 6. So this

interval contains the median. Then,

L=6.0 fm=11 n=50 i=0.5 =18

Lf

32.6182511

5.00.6)%(50 L

x

fnxf

iLP

Department of Health Statistics

median

[Exercise][Exercise]

Calculate P25 and P75 in example 1.2

75.57%255011

5.05.5)%(25 L

x

fnxf

iLP

87.629%755011

5.05.6)%(75 L

x

fnxf

iLP

Department of Health Statistics

median

Department of Health Statistics

[Properties of the Median][Properties of the Median]

It is not affected by extreme values.

It is the best index when there is no

exact value in one or two ends of the

distribution.

median

[Exercise][Exercise]

One doctor measured the delitescence (days) of some infectious disease in 10 patients. The outcomes are as follows:

6 , 13 , 5 , 9 , 12 , 10 , 8 , 11 , 8 ,> 14

Please calculate the average delitescence.

Department of Health Statistics

median

There is no exact value at the right end of There is no exact value at the right end of

distribution, so we should choose median. distribution, so we should choose median.

Firstly, we Sort the data from the smallest Firstly, we Sort the data from the smallest

to the largest oneto the largest one

5 6 8 8 9 10 11 12 13 > 14

calculate the mean of 9 and 10, it is 9.5

So the average delitescence is 9.5 days

Department of Health Statistics

[answer]median

Department of Health Statistics

[Usage of median]

• Median can be used in any type of quantitative variable, not only for the data with the normal distribution, but also for the data with the skewed distribution or when there are some unknown values in the data.

• In symmetrical data, mean equals to median theoretically.

median

Mode

[Definition] The mode of a set of

measurements is defined to be the

measurement that occurs most

often(with the highest frequency).

Department of Health Statistics

Department of Health Statistics

[Example 2.8]

Please find out the mode of 9

undergraduates’ English scores

76 87 69 76 85 80 79 81 83

We will find that there are two ’76’ in this

example, so the mode is 76.

Mode is the observation unit which

occur most often. In some cases,

perhaps there are more than one

modes.

Department of Health Statistics

Department of Health Statistics

[Example 2.9]

Please find out the mode of 10 boy’s heights

(m).

1.45,1.50,1.32,1.37,1.45,1.60

1.48,1.41,1.35,1.50

We will find that there are two modes in

this example: 1.45 and 1.50.

Department of Health Statistics

Summary

In a normal distribution, the mean,

median, and mode are identical.

For normal distributions, the mean is the

most efficient and can reflect character

of all measurements.

Department of Health Statistics

Department of Health Statistics

Section 3 Measures of

dispersion tendency

Central tendency can reflect the

average level of quantitative variable.

But it is not enough to know the central

tendency of the distribution only, we

should also describe the variation of

the observations.

Department of Health Statistics

Department of Health Statistics

Group A: 3 4 5 6 7

Group B: 1 3 5 7 9

Mean of group A=(3+4+5+6+7)/5=5

Mean of group B=(1+3+5+7+9)/5=5

The dispersions of the two groups are

different.

Range

Quartile range

Variance or standard

deviation Coefficient of

variation

2

1

3

4

Disp

ersio

n te

nden

cy

Dispersion tendency reflects the

degree of variability of different

measurements.

[Definition]

Department of Health Statistics

Value(min)-Value(max)Range

Range is the difference between MAX

and MIN.

range

[example 3.1][example 3.1]

Determine the range of the following data set.

1, 6, 2, 3, 9, 7, 5

[solution 3.1]

RANGE=9-1=8.

Department of Health Statistics

range

Merit of range

It is the simplest

measurement of

data variability.

limitation of range

It is least useful for it

can only reflect the

difference between

MAX and MIN. And it is

easily affected by

extreme value.

Department of Health Statistics

range

Department of Health Statistics

The interquartile range is the distance

between the third quartile Q3 (P75) and the

first quartile Q1 (P25) .

This distance will include the middle 50

percent of the observations.

Interquartile range = Q3 - Q1

[Definition]

25% 25% 25% 25%

L Q1 Q2 Q3 U

interquartile Rangeinterquartile Range

[Example 3.2]

Calculate the IQR in example 1.1

in virtue of the following table.

Department of Health Statistics

interquartile Rangeinterquartile Range

Department of Health Statistics

Lower limit

Cholesterol

(mmol/L) frequence percentage

Cumulative

frequence

Cumulative

percentage

4.0-4.5 1 2% 1 2%

4.5-5.0 2 4% 3 6%

5.0-5.5 4 8% 7 14%

5.5-6.0 11 22% 18 36%

6.0-6.5 11 22% 29 58%

6.5-7.0 11 22% 40 80%

7.0-7.5 7 14% 47 94%

7.5-8.0 3 6% 50 100%

total 50 100%

Upper limit

interquartile Rangeinterquartile Range

[Solution 3.2] [Solution 3.2]

Above all, we should calculate PAbove all, we should calculate P2525 and P and P7575

75.57%255011

5.05.5)%(25 L

x

fnxf

iLP

87.629%755011

5.05.6)%(75 L

x

fnxf

iLP

Department of Health Statistics

IQR=6.87-5.75=1.12

interquartile Rangeinterquartile Range

Department of Health Statistics

IQR(Q), although more sensitive to

data pileup about the midpoint than

the range, is still not sufficient for our

purpose. It can only reflect the

variability of middle 50%

measurements. And also, it is limited

in interpreting the variability of s

single set of measurements.

[Properties]interquartile Rangeinterquartile Range

The population variance of a set of

n measurements x1,x2… with

arithmetic mean μ is the sum of

the squared deviations divided by

n.

Department of Health Statistics

[ Definition]

variance

2

2

( )X

N

The sample variance of a set of n

measurements x1,x2… with arithmetic

mean is the sum of the squared

deviations divided by n-1.

X

Department of Health Statistics

[ Definition]

variance

1

)( 22

n

XXs

Department of Health Statistics

variance

1

)( 22

n

XXs

mean

Degree of freedom

2)( XX

is the squared deviation

[Example 3.3]

The time between an electric light stimulus and a bar press to avoid a shock was noted for each of five conditioned rats. Use the data below to compute the sample variance.

Shock avoidance times (seconds): 5,4,3,1,3

Department of Health Statistics

variance

[Solution 3.3][Solution 3.3]

Department of Health Statistics

XX i 2)( XX i Xi

5 1.8 3.24

6 0.8 0.64

7 -0.2 0.04

8 -2.2 4.84

3 - 0.2 0.04

TOTAL 16 0 8.80

The deviations and the squared deviations are shown below. The sample mean is 3.2

variance

[Solution 3.3][Solution 3.3]

Using the total of the squared deviations column, we find the sample variance to be

2.24

8.8

1

)( 22

n

XXs

Department of Health Statistics

variance

Department of Health Statistics

All values are used in the calculation.

Not influenced by extreme values.

The units of variance is difficult to

explain, It is the square of the original

units.

[Properties]

variance

[definition]

Standard deviation is the positive

square root of the variance.

[symbol]

Population standard deviation σ

Sample standard deviation S

Department of Health Statistics

Standard deviation

N

X 2)(

1

)( 2

n

XXS

[Example 3.4][Example 3.4]

Calculate the sample standard deviation in Example 3.3

[solution 3.4]

48.12.24

8.8

1

)( 2

n

XXs

Department of Health Statistics

Standard deviation

Department of Health Statistics

– It is the best measurement describing

the variability of quantitative variable,

which can reflect the variability of any

data.

–Only when the data come from normal

distribution, can it be used.

[Properties ]

Standard deviation

[definition]

The coefficient of variation is the ratio of

the standard deviation to the arithmetic

mean, expressed as a percentage:

Department of Health Statistics

%100X

sCV

Coefficient of VariationCoefficient of Variation

[Usage][Usage]

The measurements with different units,

such as the variability comparison of height

(cm) and weight (kg)

When the mean of two groups is quite

different, one is very small, while the other

is very large. such as the weight of

elephants and infants

Department of Health Statistics

Coefficient of VariationCoefficient of Variation

[example 3.6][example 3.6]

kgSkgXWeight

cmScmXHeight

7,64:

5.8,165:

Department of Health Statistics

One doctor measured the heights and

weights of 50 people, the outcome is

Compare which variability is much larger

between height and weight?

Coefficient of VariationCoefficient of Variation

[Solution 3.6][Solution 3.6]

%9.10%10064/7:

%15.5%100165/5.8:

CVWeight

CVHeight

Department of Health Statistics

So the variability of weight is much larger.

Coefficient of VariationCoefficient of Variation

Department of Health Statistics

Department of Health Statistics

SX

Description of data from normal distribution

)( 7525 PPM

Description of data from skewed distribution

94

top related