top 10 concepts statistics

Upload: neeebbbsy89

Post on 28-Mar-2016

13 views

Category:

Documents


0 download

DESCRIPTION

Basic Statistics Reviewer

TRANSCRIPT

  • Review of Top 10 Concepts

    in Statistics

    NOTE: This Power Point file is not an introduction,

    but rather a checklist of topics to review

  • Top Ten #1

    Descriptive Statistics

  • Measures of Central Location

    Mean

    Median

    Mode

  • Mean

    Population mean == x/N = (5+1+6)/3 = 12/3 = 4

    Algebra: x = N* = 3*4 =12

    Sample mean = x-bar = x/n

    Example: the number of hours spent on the Internet: 4, 8, and 9

    x-bar = (4+8+9)/3 = 7 hours

    Do NOT use if the number of observations is small or with extreme values

    Ex: Do NOT use if 3 houses were sold this week, and one was a mansion

  • Median

    Median = middle value

    Example: 5,1,6

    Step 1: Sort data: 1,5,6

    Step 2: Middle value = 5

    When there is an even number of observation,

    median is computed by averaging the two

    observations in the middle.

    OK even if there are extreme values

    Home sales: 100K,200K,900K, so

    mean =400K, but median = 200K

  • Mode

    Mode: most frequent value

    Ex: female, male, female

    Mode = female

    Ex: 1,1,2,3,5,8

    Mode = 1

    It may not be a very good measure, see the

    following example

  • Measures of Central Location -

    Example

    Sample: 0, 0, 5, 7, 8, 9, 12, 14, 22, 23

    Sample Mean = x-bar = x/n = 100/10 = 10

    Median = (8+9)/2 = 8.5

    Mode = 0

  • Relationship

    Case 1: if probability distribution symmetric

    (ex. bell-shaped, normal distribution),

    Mean = Median = Mode

    Case 2: if distribution positively skewed to

    right (ex. incomes of employers in large firm: a

    large number of relatively low-paid workers

    and a small number of high-paid executives),

    Mode < Median < Mean

  • Relationship contd

    Case 3: if distribution negatively skewed to left (ex. The time taken by students to write exams: few students hand their exams early and majority of students turn in their exam at the end of exam), Mean < Median < Mode

  • Dispersion Measures of Variability

    How much spread of data

    How much uncertainty

    Measures

    Range

    Variance

    Standard deviation

  • Range

    Range = Max-Min > 0

    But range affected by unusual values

    Ex: Santa Monica has a high of 105 degrees

    and a low of 30 once a century, but range

    would be 105-30 = 75

  • Standard Deviation (SD)

    Better than range because all data used

    Population SD = Square root of variance

    =sigma =

    SD > 0

  • Empirical Rule

    Applies to mound or bell-shaped curves

    Ex: normal distribution

    68% of data within + one SD of mean

    95% of data within + two SD of mean

    99.7% of data within + three SD of mean

  • Standard Deviation =

    Square Root of Variance

    1

    )( 2

    n

    xxs

  • Sample Standard Deviation

    x

    6 6-8=-2 (-2)(-2)= 4

    6 6-8=-2 4

    7 7-8=-1 (-1)(-1)= 1

    8 8-8=0 0

    13 13-8=5 (5)(5)= 25

    Sum=40 Sum=0 Sum = 34

    Mean=40/5=8

    xx 2)( xx

  • Standard Deviation

    Total variation = 34

    Sample variance = 34/4 = 8.5

    Sample standard deviation =

    square root of 8.5 = 2.9

  • Measures of Variability - Example

    The hourly wages earned by a sample of five students

    are:

    $7, $5, $11, $8, and $6

    Range: 11 5 = 6

    Variance:

    Standard deviation:

    30.5

    15

    2.21

    15

    4.76...4.77

    1

    222

    2

    n

    XXs

    30.230.52 ss

  • Graphical Tools

    Line chart: trend over time

    Scatter diagram: relationship between two variables

    Bar chart: frequency for each category

    Histogram: frequency for each class of measured data (graph of frequency distr.)

    Box plot: graphical display based on quartiles, which divide data into 4 parts

  • Top Ten #2

    Hypothesis Testing

  • Population mean=

    Population proportion=

    A statement about the value of a population

    parameter

    Never include sample statistic (such as, x-

    bar) in hypothesis

    H0: Null Hypothesis

  • HA or H1: Alternative Hypothesis

    ONE TAIL ALTERNATIVE

    Right tail: >number(smog ck)

    >fraction(%defectives)

    Left tail:

  • One-Tailed Tests

    A test is one-tailed when the alternate

    hypothesis, H1 or HA, states a direction, such as:

    H1: The mean yearly salaries earned by full-time

    employees is more than $45,000. (>$45,000)

    H1: The average speed of cars traveling on

    freeway is less than 75 miles per hour. (

  • Two-Tail Alternative

    Population mean not equal to number (too

    hot or too cold)

    Population proportion not equal to fraction (%

    alcohol too weak or too strong)

  • Two-Tailed Tests

    A test is two-tailed when no direction is

    specified in the alternate hypothesis

    H1: The mean amount of time spent for the

    Internet is not equal to 5 hours. ( 5).

    H1: The mean price for a gallon of gasoline

    is not equal to $2.54. ( $2.54).

  • Reject Null Hypothesis (H0) If

    Absolute value of test statistic* > critical value*

    Reject H0 if |Z Value| > critical Z

    Reject H0 if | t Value| > critical t

    Reject H0 if p-value < significance level (alpha)

    Note that direction of inequality is reversed!

    Reject H0 if very large difference between sample

    statistic and population parameter in H0

    * Test statistic: A value, determined from sample information, used to determine

    whether or not to reject the null hypothesis.

    * Critical value: The dividing point between the region where the null hypothesis is

    rejected and the region where it is not rejected.

  • Example: Smog Check

    H0 : = 80

    HA: > 80

    If test statistic =2.2 and critical value = 1.96,

    reject H0, and conclude that the population

    mean is likely > 80

    If test statistic = 1.6 and critical value = 1.96,

    do not reject H0, and reserve judgment about

    H0

  • Type I vs Type II Error

    Alpha= = P(type I error) = Significance level = probability that you reject true null hypothesis

    Beta= = P(type II error) = probability you do not reject a null hypothesis, given H0 false

    Ex: H0 : Defendant innocent

    = P(jury convicts innocent person)

    =P(jury acquits guilty person)

  • Type I vs Type II Error

    H0 true H0 false

    Reject H0 Alpha = =

    P(type I error)

    1 (Correct Decision)

    Do not reject H0 1 (Correct Decision)

    Beta = =

    P(type II error)

  • Example: Smog Check

    H0 : = 80

    HA: > 80

    If p-value = 0.01 and alpha = 0.05, reject H0,

    and conclude that the population mean is

    likely > 80

    If p-value = 0.07 and alpha = 0.05, do not

    reject H0, and reserve judgment about H0

  • Test Statistic

    When testing for the population mean from a

    large sample and the population standard

    deviation is known, the test statistic is given

    by:

    zX

    / n

  • The processors of Best Mayo indicate on the

    label that the bottle contains 16 ounces of

    mayo. The standard deviation of the process

    is 0.5 ounces. A sample of 36 bottles from last

    hours production showed a mean weight of 16.12 ounces per bottle. At the .05

    significance level, can we conclude that the

    mean amount per bottle is greater than 16

    ounces?

    Example

  • 1. State the null and the alternative hypotheses:

    H0: = 16, H1: > 16

    3. Identify the test statistic. Because we know the population standard deviation, the test statistic is z.

    4. State the decision rule.

    Reject H0 if |z|> 1.645 (= z0.05)

    2. Select the level of significance. In this case, we selected the .05 significance level.

    Example contd

  • 5. Compute the value of the test statistic

    44.1365.0

    00.1612.16

    n

    Xz

    6. Conclusion: Do not reject the null hypothesis.

    We cannot conclude the mean is greater than 16

    ounces.

    Example contd

  • Top Ten #3

    Confidence Intervals: Mean and Proportion

  • Confidence Interval

    A confidence interval is a range of values within

    which the population parameter is expected

    to occur.

  • Factors for Confidence Interval

    The factors that determine the width of a confidence interval are:

    1. The sample size, n

    2. The variability in the population, usually estimated by standard deviation.

    3. The desired level of confidence.

  • Confidence Interval: Mean

    Use normal distribution (Z table if):

    population standard deviation (sigma)

    known and either (1) or (2):

    (1) Normal population

    (2) Sample size > 30

  • Confidence Interval: Mean

    If normal table, then

    nz

    n

    x

  • Normal Table

    Tail = .5(1 confidence level)

    NOTE! Different statistics texts have different

    normal tables

    This review uses the tail of the bell curve

    Ex: 95% confidence: tail = .5(1-.95)= .025

    Z.025 = 1.96

  • Example

    n=49, x=490, =2, 95% confidence

    9.44 < < 10.56

    56.01049

    296.1

    49

    490

  • One of SOM professors wants to estimate the mean number of hours worked per week by students. A sample of 49 students showed a mean of 24 hours. It is assumed that the population standard deviation is 4 hours. What is the population mean?

    Another Example

  • 95 percent confidence interval for the population mean.

    12.100.24

    49

    496.100.2496.1

    n

    X

    The confidence limits range from 22.88 to

    25.12. We estimate with 95 percent

    confidence that the average number of hours

    worked per week by students lies between

    these two values.

    Another Example contd

  • Confidence Interval: Mean

    t distribution

    Use if normal population but population

    standard deviation () not known

    If you are given the sample standard

    deviation (s), use t table, assuming normal

    population

    If one population, n-1 degrees of freedom

  • ns

    n

    xtn 1

    Confidence Interval: Mean

    t distribution

  • Confidence Interval:

    Proportion

    Use if success or failure

    (ex: defective or not-defective,

    satisfactory or unsatisfactory)

    Normal approximation to binomial ok if

    (n)() > 5 and (n)(1-) > 5, where

    n = sample size

    = population proportion

    NOTE: NEVER use the t table if proportion!!

  • Confidence Interval:

    Proportion

    Ex: 8 defectives out of 100, so p = .08 and

    n = 100, 95% confidence

    n

    ppzp

    )1(

    05.08. 100

    )92)(.08.0(96.108.

  • Confidence Interval:

    Proportion

    A sample of 500 people who own their house

    revealed that 175 planned to sell their homes

    within five years. Develop a 98% confidence

    interval for the proportion of people who plan to

    sell their house within five years.

    0497.35. 500

    )65)(.35(.33.235.

    35.0500

    175p

  • Interpretation

    If 95% confidence, then 95% of all confidence

    intervals will include the true population parameter

    NOTE! Never use the term probability when estimating a parameter!! (ex: Do NOT say

    Probability that population mean is between 23 and 32 is .95 because parameter is not a random variable. In fact, the population mean is a fixed but

    unknown quantity.)

  • Point vs Interval Estimate

    Point estimate: statistic (single number)

    Ex: sample mean, sample proportion

    Each sample gives different point estimate

    Interval estimate: range of values

    Ex: Population mean = sample mean + error

    Parameter = statistic + error

  • Width of Interval

    Ex: sample mean =23, error = 3

    Point estimate = 23

    Interval estimate = 23 + 3, or (20,26)

    Width of interval = 26-20 = 6

    Wide interval: Point estimate unreliable

  • Wide Confidence Interval If

    (1) small sample size(n)

    (2) large standard deviation

    (3) high confidence interval (ex: 99% confidence

    interval wider than 95% confidence interval)

    If you want narrow interval, you need a large

    sample size or small standard deviation or low

    confidence level.

  • Top Ten #4

    Linear Regression

  • Linear Regression

    Regression equation:

    =dependent variable=predicted value

    x= independent variable

    b0=y-intercept =predicted value of y if x=0

    b1=slope=regression coefficient

    =change in y per unit change in x

    xy bb 10

    y

  • Slope vs Correlation

    Positive slope (b1>0): positive correlation

    between x and y (y increase if x increase)

    Negative slope (b1

  • Simple Linear Regression

    Simple: one independent variable, one

    dependent variable

    Linear: graph of regression equation is

    straight line

  • Example

    y = salary (female manager, in thousands of

    dollars)

    x = number of children

    n = number of observations

  • Given Data

    x y

    2 48

    1 52

    4 33

  • Totals

    x y

    2 48

    1 52

    4 33 n=3

    Sum=7 Sum=133

  • Slope (b1) = -6.5

    Method of Least Squares formulas not on

    BUS 302 exam

    b1= -6.5 given

    Interpretation: If one female manager has 1

    more child than another, salary is $6,500

    lower; that is, salary of female managers

    is expected to decrease by -6.5 (in

    thousand of dollars) per child

  • Intercept (b0)

    33.23

    7

    n

    xx 33.44

    3

    133

    n

    yy

    b0 = 44.33 (-6.5)(2.33) = 59.5

    If number of children is zero,

    expected salary is $59,500

    xy bb 10

  • Regression Equation

    xy 5.65.59

  • Forecast Salary If 3 Children

    59.5 6.5(3) = 40

    $40,000 = expected salary

  • xforecasty bb 10

    yyerror

    2

    )(

    2

    2

    n

    yy

    n

    SSES

    Standard Error of Estimate

  • Standard Error of Estimate

    (1)=x (2)=y (3) =

    59.5-

    6.5x

    (4)=

    (2)-(3)

    2 48 46.5 1.5 2.25

    1 52 53 -1 1

    4 33 33.5 -.5 .25

    SSE=3.5

    y 2)( yy

  • 9.15.323

    5.3

    S

    Standard Error of Estimate

    Actual salary typically $1,900

    away from expected salary

  • Coefficient of Determination

    R2 = % of total variation in y that can be

    explained by variation in x

    Measure of how close the linear regression

    line fits the points in a scatter diagram

    R2 = 1: max. possible value: perfect linear

    relationship between y and x (straight line)

    R2 = 0: min. value: no linear relationship

  • Sources of Variation (V)

    Total V = Explained V + Unexplained V

    SS = Sum of Squares = V

    Total SS = Regression SS + Error SS

    SST = SSR + SSE

    SSR = Explained V, SSE = Unexplained

  • Coefficient of Determination

    R2 = SSR

    SST

    R2 = 197 = .98

    200.5

    Interpretation: 98% of total variation in salary

    can be explained by variation in number of

    children

  • 0 < R2 < 1

    0: No linear relationship since SSR=0

    (explained variation =0)

    1: Perfect relationship since SSR = SST

    (unexplained variation = SSE = 0), but does

    not prove cause and effect

  • R=Correlation Coefficient

    Case 1: slope (b1) < 0

    R < 0

    R is negative square root of coefficient of

    determination

    2RR

  • Our Example

    Slope = b1 = -6.5

    R2 = .98

    R = -.99

  • Case 2: Slope > 0

    R is positive square root of coefficient of

    determination

    Ex: R2 = .49

    R = .70

    R has no interpretation

    R overstates relationship

  • Caution

    Nonlinear relationship (parabola, hyperbola,

    etc) can NOT be measured by R2

    In fact, you could get R2=0 with a nonlinear

    graph on a scatter diagram

  • Summary: Correlation Coefficient

    Case 1: If b1 > 0, R is the positive square root of the coefficient of determination

    Ex#1: y = 4+3x, R2=.36: R = +.60

    Case 2: If b1 < 0, R is the negative square root of the coefficient of determination

    Ex#2: y = 80-10x, R2=.49: R = -.70

    NOTE! Ex#2 has stronger relationship, as measured by coefficient of determination

  • Extreme Values

    R=+1: perfect positive correlation

    R= -1: perfect negative correlation

    R=0: zero correlation

  • MS Excel Output

    Correlation Coefficient (-0.9912): Note

    that you need to change the sign because

    the sign of slope (b1) is negative (-6.5)

    Coefficient of Determination

    Standard Error of Estimate

    Regression Coefficient

  • Top Ten #5

    Expected Value

  • Expected Value

    Expected Value = E(x) = xP(x)

    = x1P(x1) + x2P(x2) +

    Expected value is a weighted average, also a

    long-run average

  • Example

    Find the expected age at high school

    graduation if 11 were 17 years old, 80 were

    18 years old, and 5 were 19 years old

    Step 1: 11+80+5=96

  • Step 2

    x P(x) x P(x)

    17 11/96=.115 17(.115)=1.955

    18 80/96=.833 18(.833)=14.994

    19 5/96=.052 19(.052)=.988

    E(x)= 17.937

  • Top Ten #6

    What Distribution to Use?

  • Use Binomial Distribution If:

    Random variable (x) is number of successes in n

    trials

    Each trial is success or failure

    Independent trials

    Constant probability of success () on each trial

    Sampling with replacement (in practice, people

    may use binomial w/o replacement, but theory is

    with replacement)

  • Success vs. Failure

    The binomial experiment can result in only one of two possible outcomes:

    Male vs. Female

    Defective vs. Non-defective

    Yes or No

    Pass (8 or more right answers) vs. Fail (fewer than 8)

    Buy drink (21 or over) vs. Cannot buy drink

  • Binomial Is Discrete

    Integer values

    0,1,2,n

    Binomial is often skewed, but may be symmetric

  • Normal Distribution

    Continuous, bell-shaped, symmetric

    Mean=median=mode

    Measurement (dollars, inches, years)

    Cumulative probability under normal curve : use Z table if you know population mean and population standard deviation

    Sample mean: use Z table if you know population standard deviation and either normal population or n > 30

  • t Distribution

    Continuous, mound-shaped, symmetric

    Applications similar to normal

    More spread out than normal

    Use t if normal population but population standard deviation not known

    Degrees of freedom = df = n-1 if estimating the mean of one population

    t approaches z as df increases

  • Normal or t Distribution?

    Use t table if normal population but population

    standard deviation () is not known

    If you are given the sample standard deviation

    (s), use t table, assuming normal population

  • Top Ten #7

    P-value

  • P-value

    P-value = probability of getting a sample statistic

    as extreme (or more extreme) than the sample

    statistic you got from your sample, given that the

    null hypothesis is true

  • P-value Example: one tail test

    H0: = 40

    HA: > 40

    Sample mean = 43

    P-value = P(sample mean > 43, given H0 true)

    Meaning: probability of observing a sample

    mean as large as 43 when the population mean

    is 40

    How to use it: Reject H0 if p-value < (significance level)

  • Two Cases

    Suppose = .05

    Case 1: suppose p-value = .02, then reject H0 (unlikely H0 is true; you believe population mean > 40)

    Case 2: suppose p-value = .08, then do not reject H0 (H0 may be true; you have reason to believe that the population mean may be 40)

  • P-value Example: two tail test

    H0 : = 70

    HA: 70

    Sample mean = 72

    If two-tails, then P-value =

    2 P(sample mean > 72)=2(.04)=.08

    If = .05, p-value > , so do not reject H0

  • Top Ten #8

    Variation Creates Uncertainty

  • No Variation

    Certainty, exact prediction

    Standard deviation = 0

    Variance = 0

    All data exactly same

    Example: all workers in minimum wage job

  • High Variation

    Uncertainty, unpredictable

    High standard deviation

    Ex #1: Workers in downtown L.A. have variation

    between CEOs and garment workers

    Ex #2: New York temperatures in spring range

    from below freezing to very hot

  • Comparing Standard

    Deviations

    Temperature Example

    Beach city: small standard deviation (single

    temperature reading close to mean)

    High Desert city: High standard deviation (hot

    days, cool nights in spring)

  • Standard Error of the Mean

    Standard deviation of sample mean =

    standard deviation/square root of n

    Ex: standard deviation = 10, n =4, so standard

    error of the mean = 10/2= 5

    Note that 5

  • Sampling Distribution

    Expected value of sample mean = population mean, but an individual sample mean could be smaller or larger than the population mean

    Population mean is a constant parameter, but sample mean is a random variable

    Sampling distribution is distribution of sample means

  • Example

    Mean age of all students in the building is

    population mean

    Each classroom has a sample mean

    Distribution of sample means from all

    classrooms is sampling distribution

  • Central Limit Theorem (CLT)

    If population standard deviation is known,

    sampling distribution of sample means is normal

    if n > 30

    CLT applies even if original population is

    skewed

  • Top Ten #9

    Population vs. Sample

  • Population

    Collection of all items (all light bulbs made at

    factory)

    Parameter: measure of population

    (1) population mean (average number of

    hours in life of all bulbs)

    (2) population proportion (% of all bulbs that

    are defective)

  • Sample

    Part of population (bulbs tested by inspector)

    Statistic: measure of sample = estimate of parameter

    (1) sample mean (average number of hours in life of bulbs tested by inspector)

    (2) sample proportion (% of bulbs in sample that are defective)

  • Top Ten #10

    Qualitative vs. Quantitative

  • Qualitative

    Categorical data:

    success vs. failure

    ethnicity

    marital status

    color

    zip code

    4 star hotel in tour guide

  • Qualitative

    If you need an average, do not calculate the mean

    However, you can compute the mode

    (average person is married, buys a blue car made in America)

  • Quantitative

    Two cases

    Case 1: discrete

    Case 2: continuous

  • Discrete

    (1) integer values (0,1,2,)

    (2) example: binomial

    (3) finite number of possible values

    (4) counting

    (5) number of brothers

    (6) number of cars arriving at gas station

  • Continuous

    Real numbers, such as decimal values

    ($22.22)

    Examples: Z, t

    Infinite number of possible values

    Measurement

    Miles per gallon, distance, duration of time

  • Graphical Tools

    Pie chart or bar chart: qualitative

    Joint frequency table: qualitative (relate

    marital status vs zip code)

    Scatter diagram: quantitative (distance from

    CSUN vs duration of time to reach CSUN)

  • Hypothesis Testing

    Confidence Intervals

    Quantitative: Mean

    Qualitative: Proportion