doane chapter 04b

Upload: thomasmcarter

Post on 14-Apr-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 Doane Chapter 04B

    1/63

  • 7/30/2019 Doane Chapter 04B

    2/63

    Descriptive Statistics (Part 2)

    Standardized Data

    Percentiles and Quartiles

    Box Plots

    Grouped DataSkewness and Kurtosis (optional)

    Chapter

    4

  • 7/30/2019 Doane Chapter 04B

    3/63

    For any population with mean m and standard

    deviation s, the percentage of observations that lie

    within kstandard deviations of the mean must be at

    least 100[1 1/k2].

    Developed by mathematicians Jules Bienaym

    (1796-1878) and Pafnuty Chebyshev (1821-1894).

    Standardized Data

    Chebyshevs Theorem

  • 7/30/2019 Doane Chapter 04B

    4/63

    Fork= 2 standard deviations,

    100[1 1/22] = 75%

    So, at least 75.0% will lie within m + 2s

    Fork= 3 standard deviations,

    100[1 1/32] = 88.9%

    So, at least 88.9% will lie within m + 3s

    Although applicable to any data set, these limits

    tend to be too wide to be useful.

    Standardized Data

    Chebyshevs Theorem

  • 7/30/2019 Doane Chapter 04B

    5/63

    The Empirical Rule states that for data from a

    normal distribution, we expect that for

    The normal or Gaussian distribution was named for

    Karl Gauss (1771-1855).

    The normal distribution is symmetric and is alsoknown as the bell-shaped curve.

    k= 1 about 68.26% will lie within m + 1sk= 2 about 95.44% will lie within m + 2sk= 3 about 99.73% will lie within m + 3s

    Standardized Data

    The Empir ical Rule

  • 7/30/2019 Doane Chapter 04B

    6/63

    Note: noupper bound

    is given.

    Data values

    outsidem + 3s

    are rare.

    Distance from the mean is measured in terms of

    the number of standard deviations.

    Standardized Data

    The Empir ical Rule

  • 7/30/2019 Doane Chapter 04B

    7/63

    If 80 students take an exam, how many will score

    within 2 standard deviations of the mean?

    Assuming exam scores follow a normal distribution,the empirical rule states

    about 95.44% will lie within m + 2sso 95.44% x 80 76 students will score+ 2s from m.

    How many students will score more than 2

    standard deviations from the mean?

    Standardized Data

    Example: Exam Scores

  • 7/30/2019 Doane Chapter 04B

    8/63

    Unusualobservations are those that lie beyond

    m + 2s. Outliers are observations that lie beyond

    m + 3s.

    Standardized Data

    Unusual Observat ions

  • 7/30/2019 Doane Chapter 04B

    9/63

    For example, the P/E ratio data contains several

    large data values. Are they unusual or outliers?

    7 8 8 10 10 10 10 12 13 13 13 13

    13 13 13 14 14 14 15 15 15 15 15 16

    16 16 17 18 18 18 18 19 19 19 19 19

    20 20 20 21 21 21 22 22 23 23 23 2425 26 26 26 26 27 29 29 30 31 34 36

    37 40 41 45 48 55 68 91

    Standardized Data

    Unusual Observat ions

  • 7/30/2019 Doane Chapter 04B

    10/63

    If the sample came from a normal distribution, then

    the Empirical rule states

    1x s = 22.72 1(14.08)

    2x s = 22.72 2(14.08)

    3x s = 22.72 3(14.08)

    Standardized Data

    The Empir ical Rule

    = (8.9, 38.8)

    = (-5.4, 50.9)

    = (-19.5, 65.0)

  • 7/30/2019 Doane Chapter 04B

    11/6322 72 38 88 9 50 9-5 4 65 0-19 5

    Standardized Data

    The Empir ical Rule

    Outliers Outliers

    UnusualUnusual

    Are there any unusual values or outliers?

    7 8 . . . 48 55 68 91

  • 7/30/2019 Doane Chapter 04B

    12/63

    A standardized variable (Z) redefines each

    observation in terms the number of standard

    deviations from the mean.

    ii

    xz

    m

    s

    Standardization

    formula for a

    population:

    Standardization

    formula for a

    sample:

    ii

    x xz

    s

    Standardized Data

    Defin ing a Standardized Variab le

  • 7/30/2019 Doane Chapter 04B

    13/63

    zi tells how far away the observation is from the

    mean.

    ii

    x xz s

    = 7 22.7214.08 = -1.12

    Standardized Data

    Defin ing a Standardized Variab le

    For example, for the P/E data, the first valuex1 = 7.

    The associated zvalue is

  • 7/30/2019 Doane Chapter 04B

    14/63

    i

    i

    x x

    z s

    =91 22.72

    14.08 =4.85

    A negative zvalue means the observation is below

    the mean.

    Standardized Data

    Defin ing a Standardized Variab le

    Positive zmeans the observation is above themean. Forx68 = 91,

  • 7/30/2019 Doane Chapter 04B

    15/63

    Here are the standardized zvalues for the P/E

    data:

    Standardized Data

    Defin ing a Standardized Variab le

    What do you conclude for these four values?

  • 7/30/2019 Doane Chapter 04B

    16/63

    In Excel, use =STANDARDIZE(Array, Mean,STDev) to calculate a

    standardized zvalue.

    MegaStat calculates standardized values as well

    as checks for outliers.

    Standardized Data

    Defin ing a Standardized Variab le

  • 7/30/2019 Doane Chapter 04B

    17/63

    What do we do with outliers in a data set?

    If due to erroneous data, then discard.

    An outrageous observation (one completely outsideof an expected range) is certainly invalid.

    Recognize unusual data points and outliers and

    their potential impact on your study.

    Research books and articles on how to handle

    outliers.

    Standardized Data

    Outl iers

  • 7/30/2019 Doane Chapter 04B

    18/63

    For a normal distribution, the range of values is 6s

    (from m 3s to m + 3s).

    If you know the range R(high low), you can

    estimate the standard deviation as s = R/6.

    Useful for approximating the standard deviation

    when only Ris known.

    This estimate depends on the assumption of

    normality.

    Standardized Data

    Est imat ing Sigma

  • 7/30/2019 Doane Chapter 04B

    19/63

    Percentiles are data that have been divided into

    100 groups.

    For example, you score in the 83rd percentile on a

    standardized test. That means that 83% of thetest-takers scored below you.

    Deciles are data that have been divided into

    10 groups.

    Quintiles are data that have been divided into

    5 groups.

    Quartiles are data that have been divided into

    4 groups

    Percentiles and Quartiles

    Percenti les

  • 7/30/2019 Doane Chapter 04B

    20/63

    Percentiles are used to establish benchmarks for

    comparison purposes (e.g., health care,

    manufacturing and banking industries use 5, 25,50, 75 and 90 percentiles).

    Quartiles (25, 50, and 75 percent) are commonly

    used to assess financial performance and stock

    portfolios.

    Percentiles are used in employee merit evaluation

    and salary benchmarking.

    Percentiles and Quartiles

    Percenti les

  • 7/30/2019 Doane Chapter 04B

    21/63

    Quartiles are scale points that divide the sorted

    data into four groups of approximately equal size.

    The three values that separate the four groups are

    called Q1 Q2 and Q3 respectively

    Q1 Q2 Q3

    Lower 25% | Second 25% | Third 25% | Upper 25%

    Percentiles and Quartiles

    Quart i les

  • 7/30/2019 Doane Chapter 04B

    22/63

    The second quartile Q2 is the median, an important

    indicator ofcentral tendency.

    Q1 and Q3 measure dispersion since the

    interquartile rangeQ3Q1 measures the degree ofspread in the middle 50 percent of data values.

    Q2

    Lower 50% | Upper 50%

    Q1 Q3

    Lower 25% | Middle 50% | Upper 25%

    Percentiles and Quartiles

    Quart i les

  • 7/30/2019 Doane Chapter 04B

    23/63

    The first quartile Q1 is the median of the data

    values below Q2, and the third quartile Q3 is the

    median of the data values above Q2.

    Q1 Q2 Q3

    Lower 25% | Second 25% | Third 25% | Upper 25%

    For first half of data,

    50% above,

    50% below Q1.

    For second half of data,

    50% above,

    50% below Q3.

    Percentiles and Quartiles

    Quart i les

  • 7/30/2019 Doane Chapter 04B

    24/63

    Depending on n, the quartiles Q1,Q2, and Q3 may

    be members of the data set or may lie between

    two of the sorted data values.

    Percentiles and Quartiles

    Quart i les

  • 7/30/2019 Doane Chapter 04B

    25/63

    For small data sets, find quartiles using method of

    medians:

    Step 1. Sort the observations.

    Step 2. Find the median Q2.

    Step 3. Find the median of the data values that lie

    below Q2.

    Step 4. Find the median of the data values that lie

    above Q2.

    Percentiles and Quartiles

    Method o f Medians

  • 7/30/2019 Doane Chapter 04B

    26/63

    Use Excel function =QUARTILE(Array, k) to return

    the kth quartile.

    =QUARTILE(Array, 3)

    =PERCENTILE(Array, 75)

    Excel treats quartiles as a special case of

    percentiles. For example, to calculate Q3

    Excel calculates the quartile positions as:Position of Q1 0.25n + 0.75

    Position of Q2 0.50n + 0.50

    Position of Q3 0 75n + 0 25

    Percentiles and Quartiles

    Excel Quart i les

  • 7/30/2019 Doane Chapter 04B

    27/63

    Consider the following P/E ratios for 68 stocks in a

    portfolio.

    Use quartiles to define benchmarks for stocks that

    are low-priced (bottom quartile) or high-priced (top

    quartile).

    7 8 8 10 10 10 10 12 13 13 13 13 13 13 13 14 1414 15 15 15 15 15 16 16 16 17 18 18 18 18 19 19 19

    19 19 20 20 20 21 21 21 22 22 23 23 23 24 25 26 26

    26 26 27 29 29 30 31 34 36 37 40 41 45 48 55 68 91

    Percentiles and Quartiles

    Example: P/E Ratios and Quart i les

  • 7/30/2019 Doane Chapter 04B

    28/63

    Using Excels method of interpolation, the quartile

    positions are:

    QuartilePosition

    Formula InterpolateBetween

    Q1 = 0.25(68) + 0.75 = 17.75 X17 +X18

    Percentiles and Quartiles

    Example: P/E Ratios and Quart i les

    Q2 = 0.50(68) + 0.50 = 34.50 X34 +X35Q3 = 0.75(68) + 0.25 = 51.25 X51 +X52

  • 7/30/2019 Doane Chapter 04B

    29/63

    The quartiles are:

    Quartile Formula

    First (Q1) Q1 =X17 + 0.75 (X18-X17)= 14 + 0.75 (14-14) = 14

    Percentiles and Quartiles

    Example: P/E Ratios and Quart i les

    Second (Q2) Q2 =X34 + 0.50 (X35-X34)

    = 19 + 0.50 (19-19) = 19

    Third (Q3) Q3 =X51 + 0.25 (X52-X51)

    = 26 + 0.25 (26-26) = 26

  • 7/30/2019 Doane Chapter 04B

    30/63

    So, to summarize:

    These quartiles express central tendency and

    dispersion. What is the interquartile range?

    Q1 Q2 Q3

    Lower 25%ofP/ERatios

    14 Second 25%ofP/ERatios

    19 Third 25%ofP/ERatios

    26 Upper 25%ofP/ERatios

    Because of clustering of identical data values,

    these quartiles do not provide clean cut points

    between groups of observations.

    Percentiles and Quartiles

    Example: P/E Ratios and Quart i les

  • 7/30/2019 Doane Chapter 04B

    31/63

    Whether you use the method of

    medians or Excel, your quartiles will be

    about the same. Small differences incalculation techniques typically do not

    lead to different conclusions in

    business applications.

    Percentiles and Quartiles

    Tip

  • 7/30/2019 Doane Chapter 04B

    32/63

    Quartiles generally resist outliers.

    However, quartiles do not provide clean cut points

    in the sorted data, especially in small samples with

    repeating data values.

    Data setA: 1, 2, 4, 4, 8, 8, 8, 8 Q1 = 3, Q2 = 6, Q3 = 8

    Data set B: 0, 3, 3, 6, 6, 6, 10, 15 Q1 = 3, Q2 = 6, Q3 = 8

    Although they have identical quartiles, these two

    data sets are not similar. The quartiles do not

    represent either data set well.

    Percentiles and Quartiles

    Caution

  • 7/30/2019 Doane Chapter 04B

    33/63

    Some robust measures of central tendency and

    dispersion using quartiles are:

    Statistic Formula Excel Pro Con

    Midhinge

    =0.5*(QUARTILE

    (Data,1)+QUARTILE

    (Data,3))

    Robust to

    presence

    of extreme

    datavalues.

    Less

    familiar

    to mostpeople.

    1 3

    2

    Q Q

    Percentiles and Quartiles

    Dispersion Using Quart i les

  • 7/30/2019 Doane Chapter 04B

    34/63

    Statistic Formula Excel Pro Con

    Midspread Q3Q1=QUARTILE(Data,3)-

    QUARTILE(Data,1)

    Stable

    when

    extremedata values

    exist.

    Ignores

    magnitude

    of extremedata

    values.

    Percentiles and Quartiles

    Dispersion Using Quart i les

    Coefficient

    of quartilevariation

    (CQV)

    None

    Relative

    variation in

    percent sowe can

    compare

    data sets.

    Less

    familiar tonon-

    statisticians

    3 1

    3 1100

    Q Q

    Q Q

  • 7/30/2019 Doane Chapter 04B

    35/63

    The mean of the first and third quartiles.

    For the 68 P/E ratios,

    Midhinge =1 3

    2

    Q Q

    Midhinge =1 3 14 26 20

    2 2

    Q Q

    A robust measure of central tendency since

    quartiles ignore extreme values.

    Percentiles and Quartiles

    Midhinge

  • 7/30/2019 Doane Chapter 04B

    36/63

    A robust measure of dispersion

    For the 68 P/E ratios,

    Midspread = Q3Q1

    Midspread = Q3Q1 = 26 14 = 12

    Percentiles and Quartiles

    Midspread (In terquart i le Range)

  • 7/30/2019 Doane Chapter 04B

    37/63

    Measures relative dispersion, expresses the

    midspread as a percent of the midhinge.

    For the 68 P/E ratios,

    3 1

    3 1100

    Q QCQV

    Q Q

    3 1

    3 1

    26 14

    100 100 30.0%26 14

    Q Q

    CQV Q Q

    Similar to the CV, CQVcan be used to compare

    data sets measured in different units or with

    diff t

    Percentiles and Quartiles

    Coeff icient o f Quart i le Variat ion (CQV)

    l

  • 7/30/2019 Doane Chapter 04B

    38/63

    A useful tool ofexploratory data analysis (EDA).

    Also called a box-and-whisker plot.

    Based on a five-number summary:

    Xmin, Q1, Q2, Q3,Xmax

    Consider the five-number summary for the

    68 P/E ratios:

    7 14 19 26 91

    Xmin, Q1, Q2, Q3,Xmax

    Box Plots

    l

  • 7/30/2019 Doane Chapter 04B

    39/63

    Minimum

    Median (Q2)

    Maximum

    Q1 Q3

    Box

    Whiskers

    Right-skewed

    Center of Box is Midhinge

    Box Plots

    B Pl

  • 7/30/2019 Doane Chapter 04B

    40/63

    Use quartiles to detect unusual data points.

    These points are called fences and can be found

    using the following formulas:Inner fences Outer fences:

    Lower fence Q1 1.5 (Q3Q1) Q1 3.0 (Q3Q1)

    Upper fence Q3 + 1.5 (Q3Q1) Q3 + 3.0 (Q3Q1)

    Values outside the inner fences are unusualwhile

    those outside the outer fences are outliers.

    Box Plots

    Fences and Unusual Data Values

    B Pl

  • 7/30/2019 Doane Chapter 04B

    41/63

    For example, consider the P/E ratio data:

    Ignore the lower fence since it is negative and P/E

    ratios are only positive.

    Inner fences Outer fences:

    Lower fence: 14 1.5 (2614) = 4 14 3.0 (2614) = 22Upper fence: 26 + 1.5 (2614) = +44 26 + 3.0 (2614) = +62

    Box Plots

    Fences and Unusual Data Values

    B Pl t

  • 7/30/2019 Doane Chapter 04B

    42/63

    Truncate the whisker at the fences and display

    unusual values

    and outliers

    as dots.

    Inner

    Fence

    Outer

    Fence

    Unusual Outliers

    Box Plots

    Fences and Unusual Data Values

    Based on these fences, there are three unusual

    P/E values and two outliers.

    G d D t

  • 7/30/2019 Doane Chapter 04B

    43/63

    Although some information is lost, grouped data

    are easier to display than raw data.

    When bin limits are given, the mean and standarddeviation can be estimated.

    Accuracy of grouped estimates depend on

    - the number of bins- distribution of data within bins

    - bin frequencies

    Grouped Data

    Nature of Grouped Data

    G d D t

  • 7/30/2019 Doane Chapter 04B

    44/63

    Consider the frequency distribution for prices of

    Lipitor for three cities:

    Grouped Data

    Mean and Standard Dev iat ion

    Where

    mj = class midpoint fj = class frequency

    k= number of classes n = sample size

    G d D t

  • 7/30/2019 Doane Chapter 04B

    45/63

    Estimate the mean and standard deviation by

    1

    3427.572.92552

    47

    kj j

    j

    f mx

    n

    2

    1

    ( ) 2091.489366.74293

    1 47 1

    kj j

    j

    f m xs

    n

    Note: dont round off too soon.

    Grouped Data

    Nature of Grouped Data

    G d D t

  • 7/30/2019 Doane Chapter 04B

    46/63

    How accurate are grouped estimates compared to

    ungrouped estimates?

    Now estimate the coefficient of variation

    CV= 100 (s / ) = 100 (6.74293 / 72.92552) = 9.2%x

    For the previous example, we can compare the

    grouped data statistics to the ungrouped data

    statistics.

    Grouped Data

    Nature of Grouped Data

    Accu racy Issues

    Grouped Data

  • 7/30/2019 Doane Chapter 04B

    47/63

    For this example, very little information was lost

    due to grouping.

    However, accuracy could be lost due to the natureof the grouping (i.e., if the groups were not evenly

    spaced within bins).

    Grouped Data

    Accuracy Issues

    Grouped Data

  • 7/30/2019 Doane Chapter 04B

    48/63

    The dot plot shows a relatively even distribution

    within the bins.

    Effects of uneven distributions within bins tend to

    average out unless there is systematic skewness.

    Grouped Data

    Accuracy Issues

    Grouped Data

  • 7/30/2019 Doane Chapter 04B

    49/63

    Accuracy tends to improve as the number of bins

    increases.

    If the first or last class is open-ended, there will be

    no class midpoint (no mean can be estimated).

    Assume a lower limit of zero for the first class

    when the data are nonnegative.

    You may be able to assume an upper limit forsome variables (e.g., age).

    Median and quartiles may be estimated even with

    open-ended classes.

    Grouped Data

    Accuracy Issues

    Skewness and Kurtosis

  • 7/30/2019 Doane Chapter 04B

    50/63

    Generally, skewness may be indicated by looking

    at the sample histogram or by comparing the mean

    and median.

    This visual indicator is imprecise and does not take

    into consideration sample size n.

    Skewness and Kurtosis

    Skewness

    Skewness and Kurtosis

  • 7/30/2019 Doane Chapter 04B

    51/63

    Skewness and Kurtosis

    Skewness Skewness is a unit-free statistic.

    The coefficient compares two samples measured

    in different units or one sample with a known

    reference distribution (e.g., symmetric normal

    distribution).

    Calculate the samples skewness coefficientas:

    Skewness =3

    1( 1)( 2)

    ni

    i

    x xn

    n n s

    Skewness and Kurtosis

  • 7/30/2019 Doane Chapter 04B

    52/63

    In Excel, go to

    Tools | Data Analysis |

    Descriptive Statistics or

    use the function=SKEW(array)

    Skewness and Kurtosis

    Skewness

    Skewness and Kurtosis

  • 7/30/2019 Doane Chapter 04B

    53/63

    Consider the following table showing the 90%

    range for the sample skewness coefficient.

    Skewness and Kurtosis

    Skewness

    Skewness and Kurtosis

  • 7/30/2019 Doane Chapter 04B

    54/63

    Coefficients within the 90% range may be

    attributed to random variation.

    Skewness and Kurtosis

    Skewness

    Skewness and Kurtosis

  • 7/30/2019 Doane Chapter 04B

    55/63

    Coefficients outside the range suggest the sample

    came from a nonnormal population.

    Skewness and Kurtosis

    Skewness

    Skewness and Kurtosis

  • 7/30/2019 Doane Chapter 04B

    56/63

    As n increases, the range of chance variation

    narrows.

    Skewness and Kurtosis

    Skewness

    Skewness and Kurtosis

  • 7/30/2019 Doane Chapter 04B

    57/63

    Kurtosis is the relative length of the tails and the

    degree of concentration in the center.

    Consider three kurtosis prototype shapes.

    Skewness and Kurtosis

    Kur tos is

    Skewness and Kurtosis

  • 7/30/2019 Doane Chapter 04B

    58/63

    A histogram is an unreliable guide to kurtosis since

    scale and axis proportions may differ.

    Excel and MINITAB calculate kurtosis as:

    Kurtosis =

    4 2

    1

    ( 1) 3( 1)

    ( 1)( 2)( 3) ( 2)( 3)

    ni

    i

    x xn n n

    n n n s n n

    Skewness and Kurtosis

    Kur tos is

    Skewness and Kurtosis

  • 7/30/2019 Doane Chapter 04B

    59/63

    Consider the following table of expected 90%

    range for sample kurtosis coefficient.

    Skewness and Kurtosis

    Kur tos is

    Skewness and Kurtosis

  • 7/30/2019 Doane Chapter 04B

    60/63

    A sample coefficient within the ranges may be

    attributed to chance variation.

    Skewness and Kurtosis

    Kur tos is

    Skewness and Kurtosis

  • 7/30/2019 Doane Chapter 04B

    61/63

    Coefficients outside the range would suggest the

    sample differs from a normal population.

    Skewness and Kurtosis

    Kur tos is

    Skewness and Kurtosis

  • 7/30/2019 Doane Chapter 04B

    62/63

    As sample size increases, the chance range

    narrows.

    Inferences about kurtosis are risky forn < 50.

    Kur tos is

  • 7/30/2019 Doane Chapter 04B

    63/63

    Applied Statistics inBusiness and Economics

    End of Chapter 4