week 1: probability and sampling - university of …...my background • dean’s chair in marketing...
TRANSCRIPT
-
Statistics Bootcamp
Professor P. K. Kannan
1
-
Agenda
1. Who am I?
2. Overview
3. Session 1: Probability and sampling
4. Session 2 – 1: Distributions and Hypothesis Testing
5. Session 2 – 2: Regression
2
-
Part I: Who am I?
3
-
My Background
• Dean’s Chair in Marketing Science, Professor of Marketing• Ph.D. in Management Science, OR and Statistics• Research in marketing analytics• Have taught variation of this course in MBA programs/Masters in
Marketing Analytics• Teach Customer Equity Management, Digital Marketing in MBA
program and in online mode
4
-
Part II: Bootcamp Overview
5
-
What we do…
• This bootcamp will emphasize concepts• Visualize and summarize data• Develop data intuition • Generate insights
• The best way to learn is to practice!• I have a couple of problem sets you can use to practice• I am using PPT slides from a standard business statistics textbook
6
-
Part III: Probabilities and Sampling
7
• Random variables: definition and examples • Summarizing variables: mean and standard deviation• Normal distribution and calculating probabilities• Central limit theorem (CLT) and its applications• Sampling methods
-
Random variables
• A random variable denotes the possible outcomes of a random phenomenon or event
• A random variable can be discrete OR continuous • Discrete random variables
• take fixed set of possible outcomes• have a probability associated with each outcome
• Continuous variables • take any value within a range
8
-
Examples of random variables
Discrete Continuous 1 Customer gender2 Customer age3 Total annual income of a household4 Monthly subscription fee of a fitness center
5 Veteran status6 Residential state
9
-
Examples of random variables
Discrete Continuous 1 Customer gender X2 Customer age X3 Total annual income of a household X4 Monthly subscription fee of a fitness center X
5 Veteran status X6 Residential state X
10
You may discretize a continuous variable into discrete, for example, to create a categorical variable “age group” from the continuous variable “age”
-
Summarizing variables: mean and standard deviation• First, let us get familiar with what a variable looks like in Excel• Suppose we are interested in knowing how many days American
adults go to a gym to work out• Let us denote the variable as “X”• X would take the values of 0, 1, 2, 3, 4, 5, 6, and 7
• Now, suppose we surveyed a representative sample of adults and receive data in an Excel spreadsheet (as shown on the next page)
11
-
What the data could look like
• If data are entered individual-by-individual:
• If data are entered in aggregates:
12
-
If data are entered individual-by-individual:
13
1 1 2 ...= + + += =∑
n
ini
xx x xx
n nMean calculation:
Or, the average() function in Excel
Variance calculation:
2
2 1( )
, where = sample mean1
n
ii
x xs x
n=
−=
−
∑
Or, the stdev.s() function in ExcelStandard deviation = square root of variance
-
If data are entered in aggregates:
14
7
0( )
(0.293) 1 (0.230) ... 7 (0.047)2.3330
i ii
x px=
⋅
⋅ + ⋅ + + ⋅=
=
=
∑Mean calculation:
Variance calculation:7
2 2
02 2 2
( )
( (0.293) ( (0.230) ... ( (0.047)5.4760 2.333) 1 2.333) 7 2.333)
i ii
s x x p=
= − ⋅
= ⋅ + ⋅ + + ⋅=
− − −
∑
Standard deviation = square root of variance=2.34
-
Can you think of how to do the calculation in Excel?
15
-
Some additional stuff about random variables• Suppose you take repeated independent draws from a distribution of random
variables X; Assume X ~ (Mean 2, Std. Dev 3)• Sum of 3 draws: X1+X2+X3 (Think of this as three different stocks you own in your portfolio – each has the same Mean return 2 and Std. Dev of 3; so what is mean total (sum) return on these stocks? It will be the sum of the mean return on each stock as shown below)
• What is the mean (expected value) of this sum?• E(X1+X2+X3) = E(X1)+ E(X2)+ E(X3) = 3*E(X) = 3*2 = 6(What is the standard deviation of the sum of the returns on these stocks? It will be the square root of the variance of the sum, as shown below).
• What is the variance, V, of this sum? • V(X1+X2+X3) = V(X1)+ V(X2)+ V(X3) = 3*V(X) = 3*9 = 27• Standard deviation = Square root of 27 = 5.19(We assume that the stock’s return are independent of each other to calculate the above variance & std.dev).
16
-
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 2-17
Frequency Distribution: Continuous Data
• Continuous Data: may take on any value in some interval
Example: A manufacturer of insulation randomly selects 20 winter days and records the daily high temperature
24, 35, 17, 21, 24, 37, 26, 46, 58, 30,
32, 13, 12, 38, 41, 43, 44, 27, 53, 27
(Temperature is a continuous variable because it could be measured to any degree of precision desired)
-
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 2-18
Grouping Data by Classes
Sort raw data in ascending order:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
• Find range: 58 - 12 = 46
• Select number of classes: 5 (usually between 5 and 20)• Compute class width: 10 (46/5 then round off)• Determine class boundaries:10, 20, 30, 40, 50• Compute class midpoints: 15, 25, 35, 45, 55• Count observations & assign to classes
-
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 2-19
Frequency Distribution Example
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Class Frequency
10 but under 20 3 .15 20 but under 30 6 .30 30 but under 40 5 .25 40 but under 50 4 .20 50 but under 60 2 .10
Total 20 1.00
RelativeFrequency
Frequency Distribution
-
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 2-20
Histograms
• The classes or intervals are shown on the horizontal axis
• frequency is measured on the vertical axis
• Bars of the appropriate heights can be used to represent the number of observations within each class
• Such a graph is called a histogram
-
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 2-21
Histogram
0
3
65
4
2
001234567
5 15 25 36 45 55 More
Freq
uenc
y
Class Midpoints
Histogram Example
Data in ordered array:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
No gaps between bars,
since continuous data
Chart2
5
15
25
36
45
55
More
Frequency
Frequency
Histogram
0
3
6
5
4
2
0
Sheet4
BinFrequency
105
200
300
400
500
More0
Sheet5
BinFrequency
100
203
307
404
504
602
More0
Sheet5
Frequency
Bin
Frequency
Histogram
Sheet6
BinFrequencyCumulative %BinFrequencyCumulative %
50.00%35630.00%
15210.00%45555.00%
25430.00%25475.00%
35660.00%15285.00%
45585.00%55295.00%
55295.00%651100.00%
651100.00%50100.00%
More0100.00%More0100.00%
Sheet6
Frequency
Cumulative %
Bin
Frequency
Histogram
Sheet7
BinFrequency
50
153
256
365
454
552
More0
Sheet7
Frequency
Frequency
Histogram
Sheet8
BinFrequency
00
102
204
306
405
502
601
More0
Sheet8
Frequency
Frequency
Histogram
Sheet9
BinFrequency
00
103
207
304
404
502
More0
Sheet9
Frequency
Bin
Frequency
Histogram
Sheet10
Sheet11
BinFrequency
9.93
19.96
20.91
30.94
40.94
50.92
More0
Sheet11
Frequency
Bin
Frequency
Histogram
Sheet12
BinFrequency
9.93
19.96
29.95
39.94
49.92
59.90
More0
Sheet12
Frequency
Frequency
Histogram
Sheet2
29.9
319.9
729.9
1139.9
1249.9
1459.9
16
17
17
20
21
25
27
28
31
33
34
36
43
48
Sheet3
103
206
305
404
502
-
Normal distribution
• The normal distribution plays a central role in statistics
• A normal distribution is determined by two parameters: mean (denoted as m) and standard deviation (denoted as s)
• If a random variable X follows normal distribution with mean m and standard deviation s, we write it as X ~ Normal(m, s)
• The normal distribution has the familiar bell shape, symmetrical around the mean
• With large samples, the normal distribution can be used to approximate a large variety of distributions
22
-
Histogram (or density plot) for Normal(m,s)
23
The distribution is centered around the mean, which is denoted as m
The shape of the distribution is determined by the standard deviation, denoted as s
-
Examples of normal distributions
24
-
Probability is the area under the curve
25
Prob( )X a< −Prob(X ) 1 Prob(X )a a> = − ≤
-a a
< = > =−∞ < < ∞ =
It should be obvious to you that
( 0) ( 0) 0.5( ) 1
P X P XP X
-
Illustration: calculating probability (1)
26
The area under the curve:Prob(X
-
Illustration: calculating probability (2)
• ;;
27
The area under the curve:Prob(X>200)=1-Prob(X
-
Illustration: calculating probability (3)
28
The area under the curve:Prob(50
-
Illustration: calculating probability (4)
29
The area under the curve:Prob(X
-
A Little Problem with Norm.Inv
• Entry to a certain University is determined by a national test. The scores on this test are normally distributed with a mean of 500 and a standard deviation of 100. Tom wants to be admitted to this university and he knows that he must score better than at least 70% of the students who took the test. Tom takes the test and scores 585. Will he be admitted to this university?
• X ~ N(500, 100)• Norm.Inv(0.70,500,100) = 552.44• Tom gets 585 which is > 552
30
-
Population versus sample
• Population: all members of a defined group that we are interested in studying
• Census: data collected from all members of the population
• Sample: a part of the population
31
Population Mean Sample Mean
µ x
we say is an estima fte or x µ
-
Central Limit Theorem • Suppose we would like to use a sample (with sample size n) to estimate
the population mean• Naturally, different samples (with the same sample size) will yield
different means• If we perform a large number of samples, their means should follow a
distribution, called “the sampling distribution of the mean”• CLT: the means should roughly follow a normal distribution
if the original variable X has mean of m and standard deviation of s,the sampling distribution is:
32
~ (m, )sx Normaln
-
Example: sampling distribution
• We are interested in measuring the number of hours people spend on social media per week
• Suppose the true population mean is 18.5 hours per week and the standard deviation is 5.8 hours
• If we draw a sample of 100 people, the mean social media hours follows the distribution:
• If we draw a sample of 250 people, the mean social media hours follows the distribution:
33
~ (18.5,0.58)x Normal
~ (18.5,0.37)x Normal
-
Sampling methods
Simple random sampling
Stratified sampling
Systematic sampling
Cluster sampling
Judgment sampling
34
-
Simple Random Sampling
• Every unit in the population has equal probability of being selected
• For example, if the population size is 10,000 and the sample size is 200, the probability of selection is 2 percent
• The purest form of probability sampling
• Disadvantages • Subjective to random sampling error • May be costly to implement
35
Simple random sampling Stratified sampling Systematic sampling Cluster sampling Judgment sampling
-
Stratified Sampling
• The population is divided into mutually exclusive subsets according to some important characteristics
• For example, gender is important for the question of interest; women make up 60% of the population and men make up 40%
• A stratified sample reflects the 60%-40% split of women and men
• Less sampling error than simple random sampling
36
Simple random sampling Stratified sampling Systematic sampling Cluster sampling Judgment sampling
-
Example: stratified sampling
37
Sample size =200
120=200x60% 80=200x40%
Simple random sampling
Simple random sampling
• Suppose that gender is important for the question of interest
• We know that women make up 60% of the population and men make up 40%
• If the total sample size is 200, a stratified sample would consist of 120 women and 80 men
-
Systematic Sampling• For example, select every 10th person
entering the mall to interview• Or, check a stock price at the first day or each
month for two years
Cluster Sampling• The units should form clusters, for example,
students from the same school, customers shopping at the same store
• The sampling selection is done at the cluster level following a two-step processstep 1: random select the clusters (schools, stores, etc.)step 2: choose all the units from the selected clusters
38
Simple random sampling Stratified sampling Systematic sampling Cluster sampling Judgment sampling
Statistics BootcampAgenda Part I: Who am I? My BackgroundPart II: Bootcamp OverviewWhat we do…Part III: Probabilities and Sampling Random variablesExamples of random variablesExamples of random variablesSummarizing variables: �mean and standard deviationWhat the data could look likeIf data are entered individual-by-individual:If data are entered in aggregates:Can you think of how to do the calculation �in Excel?Some additional stuff about random variablesFrequency Distribution: �Continuous DataGrouping Data by ClassesFrequency Distribution ExampleHistogramsHistogram ExampleNormal distribution Histogram (or density plot) for Normal(m,s)Examples of normal distributionsProbability is the area under the curveIllustration: calculating probability (1) Illustration: calculating probability (2) Illustration: calculating probability (3) Illustration: calculating probability (4) A Little Problem with Norm.InvPopulation versus sampleCentral Limit Theorem Example: sampling distribution Sampling methodsSlide Number 35Slide Number 36Example: stratified samplingSlide Number 38