doane chapter 04a

Upload: thomasmcarter

Post on 14-Apr-2018

227 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 Doane Chapter 04A

    1/69

  • 7/30/2019 Doane Chapter 04A

    2/69

    Descriptive Statistics (Part 1)

    Numerical Description

    Central Tendency

    Dispersion

    Chapter

    4

  • 7/30/2019 Doane Chapter 04A

    3/69

    Statistics are descriptive measures derived from asample (n items).

    Parameters are descriptive measures derived from

    a population (Nitems).

    Numerical Description

  • 7/30/2019 Doane Chapter 04A

    4/69

    Three key characteristics of numerical data:

    Characteristic Interpretation

    Central Tendency Where are the data values concentrated?

    What seem to be typical or middle datavalues?

    Numerical Description

    Dispersion How much variation is there in the data?How spread out are the data values?

    Are there unusual values?

    Shape Are the data values distributed symmetrically?Skewed? Sharply peaked? Flat? Bimodal?

  • 7/30/2019 Doane Chapter 04A

    5/69

    Numerical statistics can be used to summarize this

    random sample of brands.

    Defect rate = total no. defectsno. inspected

    x 100

    Must allow for sampling error since the analysis isbased on sampling.

    Numerical Description

    Example: Vehic le Quali ty Consider the data set of vehicle defect rates from

    J. D. Power and Associates.

  • 7/30/2019 Doane Chapter 04A

    6/69

    Numerical Description

    Number of defects per 100 vehicles, 1004 models.

  • 7/30/2019 Doane Chapter 04A

    7/69

    To begin, sort thedata in Excel.

  • 7/30/2019 Doane Chapter 04A

    8/69

    Sorted data provides insight into central tendencyand dispersion.

    Numerical Description

  • 7/30/2019 Doane Chapter 04A

    9/69

    The dot plot offers a visual impression of the data.

    Visual Disp laysNumerical Description

  • 7/30/2019 Doane Chapter 04A

    10/69

    Histograms with 5 bins (suggested by Sturges

    Rule) and 10 bins are shown below.

    Both are symmetric with no extreme values andshow a modal class toward the low end.

    Visual Disp laysNumerical Description

  • 7/30/2019 Doane Chapter 04A

    11/69

    DescriptiveStatistics in Excel

    Go to Tools | Data Analysisand selectDescriptive Statistics

  • 7/30/2019 Doane Chapter 04A

    12/69

    Highlight the datarange, specify a cellfor the upper-leftcorner of the outputrange, checkSummary Statistics

    and click OK.

  • 7/30/2019 Doane Chapter 04A

    13/69

    Here is the resulting analysis.

  • 7/30/2019 Doane Chapter 04A

    14/69

    Descriptive Statistics in MegaStat

  • 7/30/2019 Doane Chapter 04A

    15/69

  • 7/30/2019 Doane Chapter 04A

    16/69

    The central tendency is the middle or typicalvalues of a distribution.

    Central tendency can be assessed using a dot

    plot, histogram or more precisely with numericalstatistics.

    Central Tendency

  • 7/30/2019 Doane Chapter 04A

    17/69

    Statistic Formula Excel Formula Pro Con

    Mean =AVERAGE(Data)

    Familiar anduses all the

    sampleinformation.

    Influenced

    by extremevalues.1

    1 n

    i

    i

    x

    n

    Central Tendency

    Six Measures o f Cen tral Tendency

    Median

    Middlevalue in

    sortedarray

    =MEDIAN(Data)

    Robust when

    extreme datavalues exist.

    Ignoresextremesand can be

    affected bygaps in datavalues.

  • 7/30/2019 Doane Chapter 04A

    18/69

    Statistic Formula Excel Formula Pro Con

    Mode

    Most

    frequentlyoccurringdata value

    =MODE(Data)

    Useful forattribute

    data ordiscrete datawith a smallrange.

    May not beunique,

    and is nothelpful forcontinuousdata.

    Central Tendency

    Six Measures o f Cen tral Tendency

    Midrange=0.5*(MIN(Data)

    +MAX(Data))

    Easy tounderstandandcalculate.

    Influenced

    by extremevalues andignoresmost datavalues.

    min max

    2

    x x

  • 7/30/2019 Doane Chapter 04A

    19/69

    Statistic Formula Excel Formula Pro Con

    Geometricmean (G)

    =GEOMEAN(Data)

    Useful forgrowth

    rates andmitigateshighextremes.

    Lessfamiliar

    andrequirespositivedata.

    Trimmedmean

    Same as the

    mean exceptomit highestand lowestk% of datavalues (e.g.,

    5%)

    =TRMEAN(Data, %)

    Mitigateseffects ofextremevalues.

    Excludessome datavaluesthat couldberelevant.

    Central Tendency

    Six Measures o f Cen tral Tendency

    1 2 ...n nx x x

  • 7/30/2019 Doane Chapter 04A

    20/69

    A familiar measure of central tendency.

    In Excel, use function =AVERAGE(Data) whereData is an array of data values.

    Population Formula Sample Formula

    1

    N

    i

    i

    x

    N

    1

    n

    i

    i

    x

    xn

    Central Tendency

    Mean

  • 7/30/2019 Doane Chapter 04A

    21/69

    For the sample ofn = 37 car brands:

    1 87 93 98 ... 159 164 173 4639 125.3837 37

    n

    i

    ix

    xn

    Central Tendency

    Mean

  • 7/30/2019 Doane Chapter 04A

    22/69

    Arithmetic mean is the most familiar average.

    Affected by every sample item.

    The balancing point or fulcrum for the data.

    Central Tendency

    Character ist ics of the Mean

  • 7/30/2019 Doane Chapter 04A

    23/69

    Regardless of the shape of the distribution,absolute distances from the mean to the datapoints always sum to zero.

    1( ) 0

    n

    i

    ix x

    Central Tendency

    Character ist ics of the Mean

    Consider the followingasymmetric distribution of quizscores whose mean = 65.

    1

    ( )n

    i

    i

    x x

    = (42 65) + (60 65) + (70 65) + (75 65) + (78 65)

    = ( 23) + ( 5) + (5) + (10) + (13) = 28 + 28 = 0

  • 7/30/2019 Doane Chapter 04A

    24/69

    The median (M) is the 50th percentile or midpointof the sortedsample data.

    Mseparates the upper and lower half of the sortedobservations.

    Ifn is odd, the median is the middle observation in

    the data array. Ifn is even, the median is the average of the

    middle two observations in the data array.

    Central Tendency

    Median

  • 7/30/2019 Doane Chapter 04A

    25/69

    Central Tendency

    Median

    Forn = 8, the median is between the fourth andfifth observations in the data array.

  • 7/30/2019 Doane Chapter 04A

    26/69

    Central Tendency

    Median

    Forn = 9, the median is the fifth observation in thedata array.

  • 7/30/2019 Doane Chapter 04A

    27/69

    Consider the following n = 6 data values:11 12 15 17 21 32

    What is the median?

    M= (x3+x4)/2 = (15+17)/2 = 16

    11 12 15 16 17 21 32

    For even n, Median = / 2 ( / 2 1)

    2

    n nx x

    n/2 = 6/2 = 3 and n/2+1 = 6/2 + 1 = 4

    Central Tendency

    Median

  • 7/30/2019 Doane Chapter 04A

    28/69

    Consider the following n = 7 data values:12 23 23 25 27 34 41

    What is the median?

    M=x4 = 25

    12 23 23 25 27 34 41

    For odd n, Median = ( 1) / 2nx

    (n+1)/2 = (7+1)/2 = 8/2 = 4

    Central Tendency

    Median

  • 7/30/2019 Doane Chapter 04A

    29/69

    Use Excels function =MEDIAN(Data) where Data

    is an array of data values.

    For the 37 vehicle quality ratings (odd n) theposition of the median is(n+1)/2 = (37+1)/2 = 19.

    So, the median isx19 = 121.

    When there are several duplicate data values, themedian does not provide a clean 50-50 split inthe data.

    Central Tendency

    Median

  • 7/30/2019 Doane Chapter 04A

    30/69

    The median is insensitive to extreme data values.

    For example, consider the following quiz scores for3 students:

    Toms scores:20, 40, 70, 75, 80 Mean =57, Median = 70, Total = 285

    Jakes scores:60, 65, 70, 90, 95 Mean = 76, Median = 70, Total = 380

    Marys scores:50, 65, 70, 75, 90 Mean = 70, Median = 70, Total = 350

    What does the median for each student tell you?

    Central Tendency

    Character ist ics of the Median

  • 7/30/2019 Doane Chapter 04A

    31/69

    The most frequently occurring data value.

    Similar to mean and median if data values occur

    often near the center of sorted data.

    May have multiple modes or no mode.

    Central Tendency

    Mode

  • 7/30/2019 Doane Chapter 04A

    32/69

    Lees scores:

    60, 70, 70, 70, 80 Mean =70, Median = 70, Mode = 70Pats scores:

    45, 45, 70, 90, 100 Mean = 70, Median = 70, Mode = 45Sams scores:

    50, 60, 70, 80, 90 Mean = 70, Median = 70, Mode = noneXiaos scores:

    50, 50, 70, 90, 90 Mean = 70, Median = 70, Modes = 50,90

    Central Tendency

    Mode For example, consider the following quiz scores for

    3 students:

    What does the mode for each student tell you?

  • 7/30/2019 Doane Chapter 04A

    33/69

    Easy to define, not easy to calculate in largesamples.

    Use Excels function =MODE(Array)- will return #N/A if there is no mode.- will return first mode found if multimodal.

    May be far from the middle of the distribution andnot at all typical.

    Central Tendency

    Mode

  • 7/30/2019 Doane Chapter 04A

    34/69

    Generally isnt useful for continuous data since

    data values rarely repeat.

    Best for attribute data or a discrete variable with asmall range (e.g., Likert scale).

    Central Tendency

    Mode

  • 7/30/2019 Doane Chapter 04A

    35/69

    Consider the following P/Eratios for a randomsample of 68 Standard & Poors 500 stocks.

    What is the mode?

    Central Tendency

    Examp le: Price/Earn ings Ratios and Mode

    7 8 8 10 10 10 10 12 13 13 13 13 13 13 13 14 1414 15 15 15 15 15 16 16 16 17 18 18 18 18 19 19 19

    19 19 20 20 20 21 21 21 22 22 23 23 23 24 25 26 26

    26 26 27 29 29 30 31 34 36 37 40 41 45 48 55 68 91

  • 7/30/2019 Doane Chapter 04A

    36/69

    Excels descriptive

    statistics results are:

    The mode 13 occurs7 times, but whatdoes the dot plotshow?

    Mean 22.7206

    Median 19

    Mode 13

    Range 84

    Minimum 7

    Maximum 91

    Sum 1545

    Count 68

    Central Tendency

    Examp le: Price/Earn ings Ratios and Mode

  • 7/30/2019 Doane Chapter 04A

    37/69

    The dot plot shows local modes (a peak withvalleys on either side) at 10, 13, 15, 19, 23, 26, 29.

    These multiple modes suggest that the mode isnot a stable measure of central tendency.

    Central Tendency

    Examp le: Price/Earn ings Ratios and Mode

    l d

  • 7/30/2019 Doane Chapter 04A

    38/69

    Points scored by the winning NCAA football teamtends to have modes in multiples of 7 becauseeach touchdown yields 7 points.

    Central Tendency

    Example: Rose Bowl Winners Points

    Consider the dot plot of the points scored by thewinning team in the first 87 Rose Bowl games.

    What is the mode?

    C l d

  • 7/30/2019 Doane Chapter 04A

    39/69

    A bimodal distribution refers to the shape of thehistogram rather than the mode of the raw data.

    Occurs when dissimilar populations are combinedin one sample. For example,

    Central Tendency

    Mode

    C l T d

  • 7/30/2019 Doane Chapter 04A

    40/69

    Compare mean and median or look at histogram todetermine degree of skewness.

    Central Tendency

    Skewness

    C l T d

  • 7/30/2019 Doane Chapter 04A

    41/69

    Distributions

    ShapeHistogram Appearance Statistics

    Skewed left(negativeskewness)

    Long tail of histogram points left(a few low values but most data onright)

    Mean < Median

    Central Tendency

    Symptoms o f Skewness

    SymmetricTails of histogram are balanced(low/high values offset)

    Mean Median

    Skewed right(positiveskewness)

    Long tail of histogram points right(most data on left but a few highvalues)

    Mean > Median

    C t l T d

  • 7/30/2019 Doane Chapter 04A

    42/69

    For the sample of J.D. Power quality ratings, themean (125.38) exceeds the median (121). Whatdoes this suggest?

    Central Tendency

    Skewness

    C t l T d

  • 7/30/2019 Doane Chapter 04A

    43/69

    The geometric mean (G) is amultiplicative average.

    For the J. D. Power quality data (n=37):

    1 2 ...n

    nG x x x

    37 7737 (87)(93)(98)...(164)(173) 2.37667 10 123.38G

    In Excel use =GEOMEAN(Array) The geometric mean tends to mitigate the effects

    of high outliers.

    Central Tendency

    Geometric Mean

    C t l T d

  • 7/30/2019 Doane Chapter 04A

    44/69

    A variation on the geometric mean used to find theaverage growth rate for a time series.

    For example, from

    1998 to 2002, SpiritAirlines revenuesare:

    1

    1nnx

    Gx

    Year Revenue (mil)1998 131

    1999 227

    2000 311

    2001 354

    2002 403

    Central Tendency

    Grow th Rates

    C t l T d

  • 7/30/2019 Doane Chapter 04A

    45/69

    The average growth rate is given by taking thegeometric mean of the ratios of each years

    revenue to the preceding year.

    Due to cancellations, only the first and last yearsare relevant:

    227G

    311

    131

    227

    354

    311

    403

    354

    55403

    1 1131

    = 1.2421 = .242 or 24.2% per year

    In Excel use =(403/131)^(1/5)-1

    Central Tendency

    Grow th Rates

    C t l T d

  • 7/30/2019 Doane Chapter 04A

    46/69

    The midrange is the point halfway between thelowest and highest values of X.

    Easy to use but sensitive to extreme data values.

    min max

    2x xMidrange =

    For the J. D. Power quality data (n=37):

    min max

    2

    x x

    Midrange =1 37 87 173

    1302 2

    x x

    = Here, the midrange (130) is higher than the mean

    (125.38) or median (121).

    Central Tendency

    Midrange

    Central Tendency

  • 7/30/2019 Doane Chapter 04A

    47/69

    To calculate the trimmed mean, first remove thehighest and lowest kpercent of the observations.

    For example, for the n = 68 P/E ratios, we want a 5percent trimmed mean (i.e., k= .05).

    To determine how many observations to trim,multiply kx n = 0.05 x 68 = 3.4 or 3 observations.

    So, we would remove the three smallest and threelargest observations before averaging theremaining values.

    Central Tendency

    Trimmed Mean

    Central Tendency

  • 7/30/2019 Doane Chapter 04A

    48/69

    Here is a summary of all the measures of centraltendency for the n = 68 P/E values.

    The trimmed mean mitigates the effects of veryhigh values, but still exceeds the median.

    Mean: 22.72 =AVERAGE(PERatio)

    Median: 19.00 =MEDIAN(PERatio)

    Mode: 13.00 =MODE(PERatio)

    Geometric Mean: 19.85 =GEOMEAN(PERatio)

    Midrange: 49.00 =(MIN(PERatio)+MAX(PERatio))/2

    5% Trim Mean: 21.10 =TRIMMEAN(PERatio,0.1)

    Central Tendency

    Trimmed Mean

    Central Tendency

  • 7/30/2019 Doane Chapter 04A

    49/69

    Central Tendency

    Trimmed Mean

    The FederalReserve uses a

    16% trimmedmean to mitigatethe effects ofextremes in its

    analysis of theConsumer PriceIndex.

    Dispersion

  • 7/30/2019 Doane Chapter 04A

    50/69

    Variation is the spread of data points about thecenter of the distribution in a sample. Consider thefollowing measures of dispersion:

    Statistic Formula Excel Pro Con

    Range xmaxxmin=MAX(Data)-

    MIN(Data)Easy to calculate

    Sensitive toextreme datavalues.

    Dispersion

    Variance(s2)

    =VAR(Data)Plays a key rolein mathematicalstatistics.

    Non-intuitivemeaning.

    2

    1

    1

    n

    i

    i

    x x

    n

    Measures o f Variat ion

    Dispersion

  • 7/30/2019 Doane Chapter 04A

    51/69

    Statistic Formula Excel Pro Con

    Standard

    deviation(s)

    =STDEV(Data)

    Most commonmeasure. Uses

    same units as theraw data ($ , , ,etc.).

    Non-intuitivemeaning.

    2

    1

    1

    n

    ii

    x x

    n

    Dispersion

    Measures o f Variat ion

    Coef-ficient. ofvariation(CV)

    None

    Measures relativevariation in

    percentso cancompare datasets.

    Requiresnon-negativedata.

    100s

    x

    Dispersion

  • 7/30/2019 Doane Chapter 04A

    52/69

    Statistic Formula Excel Pro Con

    Meanabsolute

    deviation(MAD)

    =AVEDEV(Data)Easy to

    understand.

    Lacks nice

    theoreticalproperties.

    Dispersion

    Measures o f Variat ion

    1

    n

    i

    i

    x x

    n

    Dispersion

  • 7/30/2019 Doane Chapter 04A

    53/69

    The difference between the largest and smallestobservation.

    Range =xmaxxmin

    For example, for the n = 68 P/E ratios,

    Range = 91 7 = 84

    Dispersion

    Range

    Dispersion

  • 7/30/2019 Doane Chapter 04A

    54/69

    Thepopulation variance (s2) isdefined as the sum of squareddeviations around the mean

    divided by the population size.

    For the sample variance (s2), wedivide by n 1 instead ofn,

    otherwise s2 would tend tounderestimate the unknownpopulation variance s2.

    2

    2 1

    N

    i

    i

    x

    N

    s

    2

    2 1

    1

    n

    ii

    x x

    sn

    Dispersion

    Variance

    Dispersion

  • 7/30/2019 Doane Chapter 04A

    55/69

    The square root of the variance.

    Units of measure are the same asX.

    Population

    standarddeviation

    2

    1

    N

    i

    i

    x

    N

    s

    Sample

    standarddeviation

    2

    1

    1

    n

    i

    i

    x x

    sn

    Explains how individual values in a data set varyfrom the mean.

    Dispersion

    Standard Deviat ion

    Dispersion

  • 7/30/2019 Doane Chapter 04A

    56/69

    Excels built in functions are

    Statist ic Excel popu lat ionformula Excel samp leformula

    Variance =VARP(Array) =VAR(Array)

    Standard deviation =STDEVP(Array) =STDEV(Array)

    Dispersion

    Standard Deviat ion

    Dispersion

  • 7/30/2019 Doane Chapter 04A

    57/69

    Consider the following five quiz scores forStephanie.

    Dispersion

    Calcu lat ing a Standard Deviat ion

    Dispersion

  • 7/30/2019 Doane Chapter 04A

    58/69

    Now, calculate the sample standard deviation:

    2

    1 2380

    595 24.391 5 1

    n

    i

    i

    x x

    s n

    Somewhat easier, the two-sum formula can alsobe used:

    2

    212

    2 1

    (360)28300

    28300 259205 595 24.391 5 1 5 1

    n

    ini

    i

    i

    x

    xn

    sn

    Dispersion

    Calcu lat ing a Standard Deviat ion

    Dispersion

  • 7/30/2019 Doane Chapter 04A

    59/69

    The standard deviation is nonnegative becausedeviations around the mean are squared.

    When every observation is exactly equal to themean, the standard deviation is zero.

    Standard deviations can be large or small,depending on the units of measure.

    Compare standard deviations onlyfor data setsmeasured in the same units and only if the meansdo not differ substantially.

    Dispersion

    Calcu lat ing a Standard Deviat ion

    Dispersion

  • 7/30/2019 Doane Chapter 04A

    60/69

    Useful for comparing variables measured indifferent units or with different means.

    A unit-free measure of dispersion Expressed as a percent of the mean.

    Only appropriate for nonnegative data. It isundefined if the mean is zero or negative.

    100s

    CVx

    Dispersion

    Coeff icient o f Variat ion

    Dispersion

  • 7/30/2019 Doane Chapter 04A

    61/69

    For example:

    Defect rates(n = 37)

    s = 22.89= 125.38 gives CV= 100 (22.89)/(125.38) = 18%

    ATM deposits(n = 100)

    s = 280.80= 233.89 gives CV= 100 (280.80)/(233.89) = 120%

    P/E ratios(n = 68)

    s = 14.28= 22.72 gives CV= 100 (14.08)/(22.72) = 62%

    x

    x

    x

    100s

    CVx

    Dispersion

    Coeff icient o f Variat ion

    Dispersion

  • 7/30/2019 Doane Chapter 04A

    62/69

    The Mean Absolute Deviation (MAD) reveals theaverage distance from an individual data point tothe mean (center of the distribution).

    Uses absolute values of the deviations around themean.

    Excels function is =AVEDEV(Array)

    1

    n

    i

    i

    x x

    MADn

    p

    Mean Abso lute Deviat ion

    Dispersion

  • 7/30/2019 Doane Chapter 04A

    63/69

    Consider the histograms of hole diameters drilled ina steel plate during manufacturing.

    The desired distribution is outlined in red.

    p

    Machine A Machine B

    Central Tendency vs . Dispersion:Manufactur ing

    Dispersion

  • 7/30/2019 Doane Chapter 04A

    64/69

    Desired mean (5mm)but too much variation.

    Acceptable variation butmean is less than 5 mm.

    Take frequent samples to monitor quality.

    Machine A Machine B

    p

    Central Tendency vs . Dispersion:Manufactur ing

    Dispersion

  • 7/30/2019 Doane Chapter 04A

    65/69

    Consider student ratings of four professors on eightteaching attributes (10-point scale).

    p

    Central Tendency vs . Dispersion:Job Performance

    Dispersion

  • 7/30/2019 Doane Chapter 04A

    66/69

    Jones and Wu have identical means but differentstandard deviations.

    p

    Central Tendency vs . Dispersion:Job Performance

    Dispersion

  • 7/30/2019 Doane Chapter 04A

    67/69

    Smith and Gopal have different means but identicalstandard deviations.

    p

    Central Tendency vs . Dispersion:Job Performance

    Dispersion

  • 7/30/2019 Doane Chapter 04A

    68/69

    A high mean (better rating) and low standarddeviation (more consistency) is preferred. Which

    professor do you think is best?

    p

    Central Tendency vs . Dispersion:Job Performance

  • 7/30/2019 Doane Chapter 04A

    69/69

    Applied Statistics inBusiness and Economics

    End of Part 1 of Chapter 4