week 1: probability and sampling - university of …...my background • dean’s chair in marketing...

38
Statistics Bootcamp Professor P. K. Kannan 1

Upload: others

Post on 24-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Statistics Bootcamp

    Professor P. K. Kannan

    1

  • Agenda

    1. Who am I?

    2. Overview

    3. Session 1: Probability and sampling

    4. Session 2 – 1: Distributions and Hypothesis Testing

    5. Session 2 – 2: Regression

    2

  • Part I: Who am I?

    3

  • My Background

    • Dean’s Chair in Marketing Science, Professor of Marketing• Ph.D. in Management Science, OR and Statistics• Research in marketing analytics• Have taught variation of this course in MBA programs/Masters in

    Marketing Analytics• Teach Customer Equity Management, Digital Marketing in MBA

    program and in online mode

    4

  • Part II: Bootcamp Overview

    5

  • What we do…

    • This bootcamp will emphasize concepts• Visualize and summarize data• Develop data intuition • Generate insights

    • The best way to learn is to practice!• I have a couple of problem sets you can use to practice• I am using PPT slides from a standard business statistics textbook

    6

  • Part III: Probabilities and Sampling

    7

    • Random variables: definition and examples • Summarizing variables: mean and standard deviation• Normal distribution and calculating probabilities• Central limit theorem (CLT) and its applications• Sampling methods

  • Random variables

    • A random variable denotes the possible outcomes of a random phenomenon or event

    • A random variable can be discrete OR continuous • Discrete random variables

    • take fixed set of possible outcomes• have a probability associated with each outcome

    • Continuous variables • take any value within a range

    8

  • Examples of random variables

    Discrete Continuous 1 Customer gender2 Customer age3 Total annual income of a household4 Monthly subscription fee of a fitness center

    5 Veteran status6 Residential state

    9

  • Examples of random variables

    Discrete Continuous 1 Customer gender X2 Customer age X3 Total annual income of a household X4 Monthly subscription fee of a fitness center X

    5 Veteran status X6 Residential state X

    10

    You may discretize a continuous variable into discrete, for example, to create a categorical variable “age group” from the continuous variable “age”

  • Summarizing variables: mean and standard deviation• First, let us get familiar with what a variable looks like in Excel• Suppose we are interested in knowing how many days American

    adults go to a gym to work out• Let us denote the variable as “X”• X would take the values of 0, 1, 2, 3, 4, 5, 6, and 7

    • Now, suppose we surveyed a representative sample of adults and receive data in an Excel spreadsheet (as shown on the next page)

    11

  • What the data could look like

    • If data are entered individual-by-individual:

    • If data are entered in aggregates:

    12

  • If data are entered individual-by-individual:

    13

    1 1 2 ...= + + += =∑

    n

    ini

    xx x xx

    n nMean calculation:

    Or, the average() function in Excel

    Variance calculation:

    2

    2 1( )

    , where = sample mean1

    n

    ii

    x xs x

    n=

    −=

    Or, the stdev.s() function in ExcelStandard deviation = square root of variance

  • If data are entered in aggregates:

    14

    7

    0( )

    (0.293) 1 (0.230) ... 7 (0.047)2.3330

    i ii

    x px=

    ⋅ + ⋅ + + ⋅=

    =

    =

    ∑Mean calculation:

    Variance calculation:7

    2 2

    02 2 2

    ( )

    ( (0.293) ( (0.230) ... ( (0.047)5.4760 2.333) 1 2.333) 7 2.333)

    i ii

    s x x p=

    = − ⋅

    = ⋅ + ⋅ + + ⋅=

    − − −

    Standard deviation = square root of variance=2.34

  • Can you think of how to do the calculation in Excel?

    15

  • Some additional stuff about random variables• Suppose you take repeated independent draws from a distribution of random

    variables X; Assume X ~ (Mean 2, Std. Dev 3)• Sum of 3 draws: X1+X2+X3 (Think of this as three different stocks you own in your portfolio – each has the same Mean return 2 and Std. Dev of 3; so what is mean total (sum) return on these stocks? It will be the sum of the mean return on each stock as shown below)

    • What is the mean (expected value) of this sum?• E(X1+X2+X3) = E(X1)+ E(X2)+ E(X3) = 3*E(X) = 3*2 = 6(What is the standard deviation of the sum of the returns on these stocks? It will be the square root of the variance of the sum, as shown below).

    • What is the variance, V, of this sum? • V(X1+X2+X3) = V(X1)+ V(X2)+ V(X3) = 3*V(X) = 3*9 = 27• Standard deviation = Square root of 27 = 5.19(We assume that the stock’s return are independent of each other to calculate the above variance & std.dev).

    16

  • Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 2-17

    Frequency Distribution: Continuous Data

    • Continuous Data: may take on any value in some interval

    Example: A manufacturer of insulation randomly selects 20 winter days and records the daily high temperature

    24, 35, 17, 21, 24, 37, 26, 46, 58, 30,

    32, 13, 12, 38, 41, 43, 44, 27, 53, 27

    (Temperature is a continuous variable because it could be measured to any degree of precision desired)

  • Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 2-18

    Grouping Data by Classes

    Sort raw data in ascending order:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

    • Find range: 58 - 12 = 46

    • Select number of classes: 5 (usually between 5 and 20)• Compute class width: 10 (46/5 then round off)• Determine class boundaries:10, 20, 30, 40, 50• Compute class midpoints: 15, 25, 35, 45, 55• Count observations & assign to classes

  • Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 2-19

    Frequency Distribution Example

    Data in ordered array:

    12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

    Class Frequency

    10 but under 20 3 .15 20 but under 30 6 .30 30 but under 40 5 .25 40 but under 50 4 .20 50 but under 60 2 .10

    Total 20 1.00

    RelativeFrequency

    Frequency Distribution

  • Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 2-20

    Histograms

    • The classes or intervals are shown on the horizontal axis

    • frequency is measured on the vertical axis

    • Bars of the appropriate heights can be used to represent the number of observations within each class

    • Such a graph is called a histogram

  • Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 2-21

    Histogram

    0

    3

    65

    4

    2

    001234567

    5 15 25 36 45 55 More

    Freq

    uenc

    y

    Class Midpoints

    Histogram Example

    Data in ordered array:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

    No gaps between bars,

    since continuous data

    Chart2

    5

    15

    25

    36

    45

    55

    More

    Frequency

    Frequency

    Histogram

    0

    3

    6

    5

    4

    2

    0

    Sheet4

    BinFrequency

    105

    200

    300

    400

    500

    More0

    Sheet5

    BinFrequency

    100

    203

    307

    404

    504

    602

    More0

    Sheet5

    Frequency

    Bin

    Frequency

    Histogram

    Sheet6

    BinFrequencyCumulative %BinFrequencyCumulative %

    50.00%35630.00%

    15210.00%45555.00%

    25430.00%25475.00%

    35660.00%15285.00%

    45585.00%55295.00%

    55295.00%651100.00%

    651100.00%50100.00%

    More0100.00%More0100.00%

    Sheet6

    Frequency

    Cumulative %

    Bin

    Frequency

    Histogram

    Sheet7

    BinFrequency

    50

    153

    256

    365

    454

    552

    More0

    Sheet7

    Frequency

    Frequency

    Histogram

    Sheet8

    BinFrequency

    00

    102

    204

    306

    405

    502

    601

    More0

    Sheet8

    Frequency

    Frequency

    Histogram

    Sheet9

    BinFrequency

    00

    103

    207

    304

    404

    502

    More0

    Sheet9

    Frequency

    Bin

    Frequency

    Histogram

    Sheet10

    Sheet11

    BinFrequency

    9.93

    19.96

    20.91

    30.94

    40.94

    50.92

    More0

    Sheet11

    Frequency

    Bin

    Frequency

    Histogram

    Sheet12

    BinFrequency

    9.93

    19.96

    29.95

    39.94

    49.92

    59.90

    More0

    Sheet12

    Frequency

    Frequency

    Histogram

    Sheet2

    29.9

    319.9

    729.9

    1139.9

    1249.9

    1459.9

    16

    17

    17

    20

    21

    25

    27

    28

    31

    33

    34

    36

    43

    48

    Sheet3

    103

    206

    305

    404

    502

  • Normal distribution

    • The normal distribution plays a central role in statistics

    • A normal distribution is determined by two parameters: mean (denoted as m) and standard deviation (denoted as s)

    • If a random variable X follows normal distribution with mean m and standard deviation s, we write it as X ~ Normal(m, s)

    • The normal distribution has the familiar bell shape, symmetrical around the mean

    • With large samples, the normal distribution can be used to approximate a large variety of distributions

    22

  • Histogram (or density plot) for Normal(m,s)

    23

    The distribution is centered around the mean, which is denoted as m

    The shape of the distribution is determined by the standard deviation, denoted as s

  • Examples of normal distributions

    24

  • Probability is the area under the curve

    25

    Prob( )X a< −Prob(X ) 1 Prob(X )a a> = − ≤

    -a a

    < = > =−∞ < < ∞ =

    It should be obvious to you that

    ( 0) ( 0) 0.5( ) 1

    P X P XP X

  • Illustration: calculating probability (1)

    26

    The area under the curve:Prob(X

  • Illustration: calculating probability (2)

    • ;;

    27

    The area under the curve:Prob(X>200)=1-Prob(X

  • Illustration: calculating probability (3)

    28

    The area under the curve:Prob(50

  • Illustration: calculating probability (4)

    29

    The area under the curve:Prob(X

  • A Little Problem with Norm.Inv

    • Entry to a certain University is determined by a national test. The scores on this test are normally distributed with a mean of 500 and a standard deviation of 100. Tom wants to be admitted to this university and he knows that he must score better than at least 70% of the students who took the test. Tom takes the test and scores 585. Will he be admitted to this university?

    • X ~ N(500, 100)• Norm.Inv(0.70,500,100) = 552.44• Tom gets 585 which is > 552

    30

  • Population versus sample

    • Population: all members of a defined group that we are interested in studying

    • Census: data collected from all members of the population

    • Sample: a part of the population

    31

    Population Mean Sample Mean

    µ x

    we say is an estima fte or x µ

  • Central Limit Theorem • Suppose we would like to use a sample (with sample size n) to estimate

    the population mean• Naturally, different samples (with the same sample size) will yield

    different means• If we perform a large number of samples, their means should follow a

    distribution, called “the sampling distribution of the mean”• CLT: the means should roughly follow a normal distribution

    if the original variable X has mean of m and standard deviation of s,the sampling distribution is:

    32

    ~ (m, )sx Normaln

  • Example: sampling distribution

    • We are interested in measuring the number of hours people spend on social media per week

    • Suppose the true population mean is 18.5 hours per week and the standard deviation is 5.8 hours

    • If we draw a sample of 100 people, the mean social media hours follows the distribution:

    • If we draw a sample of 250 people, the mean social media hours follows the distribution:

    33

    ~ (18.5,0.58)x Normal

    ~ (18.5,0.37)x Normal

  • Sampling methods

    Simple random sampling

    Stratified sampling

    Systematic sampling

    Cluster sampling

    Judgment sampling

    34

  • Simple Random Sampling

    • Every unit in the population has equal probability of being selected

    • For example, if the population size is 10,000 and the sample size is 200, the probability of selection is 2 percent

    • The purest form of probability sampling

    • Disadvantages • Subjective to random sampling error • May be costly to implement

    35

    Simple random sampling Stratified sampling Systematic sampling Cluster sampling Judgment sampling

  • Stratified Sampling

    • The population is divided into mutually exclusive subsets according to some important characteristics

    • For example, gender is important for the question of interest; women make up 60% of the population and men make up 40%

    • A stratified sample reflects the 60%-40% split of women and men

    • Less sampling error than simple random sampling

    36

    Simple random sampling Stratified sampling Systematic sampling Cluster sampling Judgment sampling

  • Example: stratified sampling

    37

    Sample size =200

    120=200x60% 80=200x40%

    Simple random sampling

    Simple random sampling

    • Suppose that gender is important for the question of interest

    • We know that women make up 60% of the population and men make up 40%

    • If the total sample size is 200, a stratified sample would consist of 120 women and 80 men

  • Systematic Sampling• For example, select every 10th person

    entering the mall to interview• Or, check a stock price at the first day or each

    month for two years

    Cluster Sampling• The units should form clusters, for example,

    students from the same school, customers shopping at the same store

    • The sampling selection is done at the cluster level following a two-step processstep 1: random select the clusters (schools, stores, etc.)step 2: choose all the units from the selected clusters

    38

    Simple random sampling Stratified sampling Systematic sampling Cluster sampling Judgment sampling

    Statistics BootcampAgenda Part I: Who am I? My BackgroundPart II: Bootcamp OverviewWhat we do…Part III: Probabilities and Sampling Random variablesExamples of random variablesExamples of random variablesSummarizing variables: �mean and standard deviationWhat the data could look likeIf data are entered individual-by-individual:If data are entered in aggregates:Can you think of how to do the calculation �in Excel?Some additional stuff about random variablesFrequency Distribution: �Continuous DataGrouping Data by ClassesFrequency Distribution ExampleHistogramsHistogram ExampleNormal distribution Histogram (or density plot) for Normal(m,s)Examples of normal distributionsProbability is the area under the curveIllustration: calculating probability (1) Illustration: calculating probability (2) Illustration: calculating probability (3) Illustration: calculating probability (4) A Little Problem with Norm.InvPopulation versus sampleCentral Limit Theorem Example: sampling distribution Sampling methodsSlide Number 35Slide Number 36Example: stratified samplingSlide Number 38