week 1: probability and sampling - university of …...my background • dean’s chair in marketing...

Statistics Bootcamp

Professor P. K. Kannan

1

Agenda

1. Who am I?

2. Overview

3. Session 1: Probability and sampling

4. Session 2 – 1: Distributions and Hypothesis Testing

5. Session 2 – 2: Regression

2

Part I: Who am I?

3

My Background

• Dean’s Chair in Marketing Science, Professor of Marketing• Ph.D. in Management Science, OR and Statistics• Research in marketing analytics• Have taught variation of this course in MBA programs/Masters in

Marketing Analytics• Teach Customer Equity Management, Digital Marketing in MBA

program and in online mode

4

Part II: Bootcamp Overview

5

What we do…

• This bootcamp will emphasize concepts• Visualize and summarize data• Develop data intuition • Generate insights

• The best way to learn is to practice!• I have a couple of problem sets you can use to practice• I am using PPT slides from a standard business statistics textbook

6

Part III: Probabilities and Sampling

7

• Random variables: definition and examples • Summarizing variables: mean and standard deviation• Normal distribution and calculating probabilities• Central limit theorem (CLT) and its applications• Sampling methods

Random variables

• A random variable denotes the possible outcomes of a random phenomenon or event

• A random variable can be discrete OR continuous • Discrete random variables

• take fixed set of possible outcomes• have a probability associated with each outcome

• Continuous variables • take any value within a range

8

Examples of random variables

Discrete Continuous 1 Customer gender2 Customer age3 Total annual income of a household4 Monthly subscription fee of a fitness center

5 Veteran status6 Residential state

9

Examples of random variables

Discrete Continuous 1 Customer gender X2 Customer age X3 Total annual income of a household X4 Monthly subscription fee of a fitness center X

5 Veteran status X6 Residential state X

10

You may discretize a continuous variable into discrete, for example, to create a categorical variable “age group” from the continuous variable “age”

Summarizing variables: mean and standard deviation• First, let us get familiar with what a variable looks like in Excel• Suppose we are interested in knowing how many days American

adults go to a gym to work out• Let us denote the variable as “X”• X would take the values of 0, 1, 2, 3, 4, 5, 6, and 7

• Now, suppose we surveyed a representative sample of adults and receive data in an Excel spreadsheet (as shown on the next page)

11

What the data could look like

• If data are entered individual-by-individual:

• If data are entered in aggregates:

12

If data are entered individual-by-individual:

13

1 1 2 ...= + + += =∑

n

ini

xx x xx

n nMean calculation:

Or, the average() function in Excel

Variance calculation:

2

2 1( )

, where = sample mean1

n

ii

x xs x

n=

−=

−

∑

Or, the stdev.s() function in ExcelStandard deviation = square root of variance

If data are entered in aggregates:

14

7

0( )

(0.293) 1 (0.230) ... 7 (0.047)2.3330

i ii

x px=

⋅

⋅ + ⋅ + + ⋅=

=

=

∑Mean calculation:

Variance calculation:7

2 2

02 2 2

( )

( (0.293) ( (0.230) ... ( (0.047)5.4760 2.333) 1 2.333) 7 2.333)

i ii

s x x p=

= − ⋅

= ⋅ + ⋅ + + ⋅=

− − −

∑

Standard deviation = square root of variance=2.34

Can you think of how to do the calculation in Excel?

15

Some additional stuff about random variables• Suppose you take repeated independent draws from a distribution of random

variables X; Assume X ~ (Mean 2, Std. Dev 3)• Sum of 3 draws: X1+X2+X3 (Think of this as three different stocks you own in your portfolio – each has the same Mean return 2 and Std. Dev of 3; so what is mean total (sum) return on these stocks? It will be the sum of the mean return on each stock as shown below)

• What is the mean (expected value) of this sum?• E(X1+X2+X3) = E(X1)+ E(X2)+ E(X3) = 3*E(X) = 3*2 = 6(What is the standard deviation of the sum of the returns on these stocks? It will be the square root of the variance of the sum, as shown below).

• What is the variance, V, of this sum? • V(X1+X2+X3) = V(X1)+ V(X2)+ V(X3) = 3*V(X) = 3*9 = 27• Standard deviation = Square root of 27 = 5.19(We assume that the stock’s return are independent of each other to calculate the above variance & std.dev).

16

Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 2-17

Frequency Distribution: Continuous Data

• Continuous Data: may take on any value in some interval

Example: A manufacturer of insulation randomly selects 20 winter days and records the daily high temperature

24, 35, 17, 21, 24, 37, 26, 46, 58, 30,

32, 13, 12, 38, 41, 43, 44, 27, 53, 27

(Temperature is a continuous variable because it could be measured to any degree of precision desired)


Grouping Data by Classes

Sort raw data in ascending order:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

• Find range: 58 - 12 = 46

• Select number of classes: 5 (usually between 5 and 20)• Compute class width: 10 (46/5 then round off)• Determine class boundaries:10, 20, 30, 40, 50• Compute class midpoints: 15, 25, 35, 45, 55• Count observations & assign to classes


Frequency Distribution Example

Data in ordered array:

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Class Frequency

10 but under 20 3 .15 20 but under 30 6 .30 30 but under 40 5 .25 40 but under 50 4 .20 50 but under 60 2 .10

Total 20 1.00

RelativeFrequency

Frequency Distribution


Histograms

• The classes or intervals are shown on the horizontal axis

• frequency is measured on the vertical axis

• Bars of the appropriate heights can be used to represent the number of observations within each class

• Such a graph is called a histogram


Histogram

0

3

65

4

2

001234567

5 15 25 36 45 55 More

Freq

uenc

y

Class Midpoints

Histogram Example

Data in ordered array:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

No gaps between bars,

since continuous data

Chart2

5

15

25

36

45

55

More

Frequency

Frequency

Histogram

0

3

6

5

4

2

0

Sheet4

BinFrequency

105

200

300

400

500

More0

Sheet5

BinFrequency

100

203

307

404

504

602

More0

Sheet5

Frequency

Bin

Frequency

Histogram

Sheet6

BinFrequencyCumulative %BinFrequencyCumulative %

50.00%35630.00%

15210.00%45555.00%

25430.00%25475.00%

35660.00%15285.00%

45585.00%55295.00%

55295.00%651100.00%

651100.00%50100.00%

More0100.00%More0100.00%

Sheet6

Frequency

Cumulative %

Bin

Frequency

Histogram

Sheet7

BinFrequency

50

153

256

365

454

552

More0

Sheet7

Frequency

Frequency

Histogram

Sheet8

BinFrequency

00

102

204

306

405

502

601

More0

Sheet8

Frequency

Frequency

Histogram

Sheet9

BinFrequency

00

103

207

304

404

502

More0

Sheet9

Frequency

Bin

Frequency

Histogram

Sheet10

Sheet11

BinFrequency

9.93

19.96

20.91

30.94

40.94

50.92

More0

Sheet11

Frequency

Bin

Frequency

Histogram

Sheet12

BinFrequency

9.93

19.96

29.95

39.94

49.92

59.90

More0

Sheet12

Frequency

Frequency

Histogram

Sheet2

29.9

319.9

729.9

1139.9

1249.9

1459.9

16

17

17

20

21

25

27

28

31

33

34

36

43

48

Sheet3

103

206

305

404

502

Normal distribution

• The normal distribution plays a central role in statistics

• A normal distribution is determined by two parameters: mean (denoted as m) and standard deviation (denoted as s)

• If a random variable X follows normal distribution with mean m and standard deviation s, we write it as X ~ Normal(m, s)

• The normal distribution has the familiar bell shape, symmetrical around the mean

• With large samples, the normal distribution can be used to approximate a large variety of distributions

22

Histogram (or density plot) for Normal(m,s)

23

The distribution is centered around the mean, which is denoted as m

The shape of the distribution is determined by the standard deviation, denoted as s

Examples of normal distributions

24

Probability is the area under the curve

25

Prob( )X a< −Prob(X ) 1 Prob(X )a a> = − ≤

-a a

< = > =−∞ < < ∞ =

It should be obvious to you that

( 0) ( 0) 0.5( ) 1

P X P XP X

Illustration: calculating probability (1)

26

The area under the curve:Prob(X


• ;;

27

The area under the curve:Prob(X>200)=1-Prob(X


28

The area under the curve:Prob(50


29

The area under the curve:Prob(X

A Little Problem with Norm.Inv

• Entry to a certain University is determined by a national test. The scores on this test are normally distributed with a mean of 500 and a standard deviation of 100. Tom wants to be admitted to this university and he knows that he must score better than at least 70% of the students who took the test. Tom takes the test and scores 585. Will he be admitted to this university?

• X ~ N(500, 100)• Norm.Inv(0.70,500,100) = 552.44• Tom gets 585 which is > 552

30

Population versus sample

• Population: all members of a defined group that we are interested in studying

• Census: data collected from all members of the population

• Sample: a part of the population

31

Population Mean Sample Mean

µ x

we say is an estima fte or x µ

Central Limit Theorem • Suppose we would like to use a sample (with sample size n) to estimate

the population mean• Naturally, different samples (with the same sample size) will yield

different means• If we perform a large number of samples, their means should follow a

distribution, called “the sampling distribution of the mean”• CLT: the means should roughly follow a normal distribution

if the original variable X has mean of m and standard deviation of s,the sampling distribution is:

32

~ (m, )sx Normaln

Example: sampling distribution

• We are interested in measuring the number of hours people spend on social media per week

• Suppose the true population mean is 18.5 hours per week and the standard deviation is 5.8 hours

• If we draw a sample of 100 people, the mean social media hours follows the distribution:

• If we draw a sample of 250 people, the mean social media hours follows the distribution:

33

~ (18.5,0.58)x Normal

~ (18.5,0.37)x Normal

Sampling methods

Simple random sampling

Stratified sampling

Systematic sampling

Cluster sampling

Judgment sampling

34

Simple Random Sampling

• Every unit in the population has equal probability of being selected

• For example, if the population size is 10,000 and the sample size is 200, the probability of selection is 2 percent

• The purest form of probability sampling

• Disadvantages • Subjective to random sampling error • May be costly to implement

35

Simple random sampling Stratified sampling Systematic sampling Cluster sampling Judgment sampling

Stratified Sampling

• The population is divided into mutually exclusive subsets according to some important characteristics

• For example, gender is important for the question of interest; women make up 60% of the population and men make up 40%

• A stratified sample reflects the 60%-40% split of women and men

• Less sampling error than simple random sampling

36


Example: stratified sampling

37

Sample size =200

120=200x60% 80=200x40%



• Suppose that gender is important for the question of interest

• We know that women make up 60% of the population and men make up 40%

• If the total sample size is 200, a stratified sample would consist of 120 women and 80 men

Systematic Sampling• For example, select every 10th person

entering the mall to interview• Or, check a stock price at the first day or each

month for two years

Cluster Sampling• The units should form clusters, for example,

students from the same school, customers shopping at the same store

• The sampling selection is done at the cluster level following a two-step processstep 1: random select the clusters (schools, stores, etc.)step 2: choose all the units from the selected clusters

38


Statistics BootcampAgenda Part I: Who am I? My BackgroundPart II: Bootcamp OverviewWhat we do…Part III: Probabilities and Sampling Random variablesExamples of random variablesExamples of random variablesSummarizing variables: �mean and standard deviationWhat the data could look likeIf data are entered individual-by-individual:If data are entered in aggregates:Can you think of how to do the calculation �in Excel?Some additional stuff about random variablesFrequency Distribution: �Continuous DataGrouping Data by ClassesFrequency Distribution ExampleHistogramsHistogram ExampleNormal distribution Histogram (or density plot) for Normal(m,s)Examples of normal distributionsProbability is the area under the curveIllustration: calculating probability (1) Illustration: calculating probability (2) Illustration: calculating probability (3) Illustration: calculating probability (4) A Little Problem with Norm.InvPopulation versus sampleCentral Limit Theorem Example: sampling distribution Sampling methodsSlide Number 35Slide Number 36Example: stratified samplingSlide Number 38

week 1: probability and sampling - university of …...my background • dean’s chair in marketing...

Documents