basic statistics ii biostatistics, mha, cdc, jul 09 prof. kg satheesh kumar asian school of business

50
Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Post on 20-Dec-2015

221 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Basic Statistics II

Biostatistics, MHA, CDC, Jul 09

Prof. KG Satheesh Kumar

Asian School of Business

Page 2: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Frequency Distribution and Probability Distribution

• Frequency Distribution: Plot of frequency along y-axis and variable along the x-axis

• Histogram is an example

• Probability Distribution: Plot of probability along y-axis and variable along x-axis

• Both have same shape

• Properties of probability distributions• Probability is always between 0 and 1• Sum of probabilities must be 1

Page 3: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Theoretical Probability Distributions

• For a discrete variable we have discrete probability distribution

• Binomial Distribution• Poisson Distribution• Geometric Distribution• Hypergeometric Distribution

• For a continuous variable we have continuous probability distribution

• Uniform (rectangular) Distribution• Exponential Distribution• Normal Distribution

Page 4: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

The Normal Distribution

• If a random variable, X is affected by many independent causes, none of which is overwhelmingly large, the probability distribution of X closely follows normal distribution. Then X is called normal variate and we write X ~ N(, 2), where is the mean and 2 is the variance

• A Normal pdf is completely defined by its mean, and variance, 2. The square root of variance is called standard deviation .

• If several independent random variables are normally distributed, their sum will also be normally distributed with mean equal to the sum of individual means and variance equal to the sum of individual variances.

Page 5: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

The Normal pdf

Page 6: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

The area under any pdf between two given values of X is the probability that X falls

between these two values

Page 7: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Standard Normal Variate, Z

• SNV, Z is the normal random variable with mean 0 and standard deviation 1

• Tables are available for Standard Normal Probabilities

• X and Z are connected by:Z = (X - ) / and X = + Z

• The area under the X curve between X1 and X2 is equal to the area under Z curve between Z1 and Z2.

Page 8: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

• z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09• 0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359• 0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753• 0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141• 0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517• 0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879• 0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224• 0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549• 0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852• 0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133• 0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389• 1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621• 1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830• 1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015• 1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177• 1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319• 1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441• 1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545• 1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633• 1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706• 1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767• 2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817• 2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857• 2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890• 2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916• 2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936• 2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952• 2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964• 2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974• 2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981• 2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986• 3.0 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990• 3.1 0.4990 0.4991 0.4991 0.4991 0.4992 0.4992 0.4992 0.4992 0.4993 0.4993• 3.2 0.4993 0.4993 0.4994 0.4994 0.4994 0.4994 0.4994 0.4995 0.4995 0.4995• 3.3 0.4995 0.4995 0.4995 0.4996 0.4996 0.4996 0.4996 0.4996 0.4996 0.4997• 3.4 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4998

Standard Normal Probabilities (Table of z distribution)

The z-value is on the left and top margins and the probability (shaded area in the diagram) is in the body of the table

Page 9: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Illustration

Q.A tube light has mean life of 4500 hours with a standard deviation of 1500 hours. In a lot of 1000 tubes estimate the number of tubes lasting between 4000 and 6000 hours

A. P(4000<X<6000) = P(-1/3<Z<1) = 0.1306 + 0.3413 = 0.4719

Hence the probable number of tubes in a lot of 1000 lasting 4000 to 6000 hours is 472

Page 10: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Illustration

Q. Cost of a certain procedure is estimated to average Rs.25,000 per patient. Assuming normal distribution and standard deviation of Rs.5000, find a value such that 95% of the patients pay less than that.

A. Using tables, P(Z<Z1) = 0.95 gives Z1 = 1.645. Hence X1 = 25000 + 1.645 x 5000 = Rs.33,225 95% of the patients pay less than Rs.33,225

Page 11: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Sampling Basics

• Population or Universe is the collection of all units of interest. E.g.: Households of a specific type in a given city at a certain time. Population may be finite or infinite

• Sampling Frame is the list of all the units in the population with identifications like Sl.Nos, house numbers, telephone nos etc

• Sample is a set of units drawn from the population according to some specified procedure

• Unit is an element or group of elements on which observations are made. E.g. a person, a family, a school, a book, a piece of furniture etc.

Page 12: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Census Vs Sampling

• Census– Thought to be accurate and reliable, but often not so if

the population is large– More resources (money, time, manpower)– Unsuitable for destructive tests

• Sampling– Less resources– Highly qualified and skilled persons can be used– Sampling error, which can be reduced using large and

representative sample

Page 13: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Sampling Methods

• Probability Sampling (Random Sampling)– Simple Random Sampling– Systematic Random Sampling– Stratified Random Sampling– Cluster Sampling (Single stage , Multi-stage)

• Non-probability Sampling– Convenience Sampling– Judgment Sampling– Quota Sampling

Page 14: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Limitations of Non-Random Sampling

• Selection does not ensure a known chance that a unit will be selected (i.e. non-representative)

• Inaccurate in view of the selection bias

• Results cannot be used for generalisation because inferential statistics requires probability sampling for valid conclusions

• Useful for pilot studies and exploratory research

Page 15: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Sampling Distribution and Standard Error of the Mean

• The sampling distribution of x is the probability distribution of all possible values of x for a given sample size n taken from the population.

• According to the Central Limit Theorem, for large enough sample size, n, the sampling distribution is approximately normal with mean and standard deviation /n. This standard deviation is called standard error of the mean.

• CLT holds for non-normal populations also and states: For large enough n, x ~ N(, 2/n)

Page 16: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Illustration

Q. When sampling from a population with SD 55, using a sample size of 150, what is the probability that the sample mean will be at least 8 units away from the population mean?

A. Standard Error of the mean, SE = 55/sqrt(150) = 4.4907Hence 8 units = 1.7815 SEArea within 1.7815 SE on both sides of the mean = 2 * 0.4625 = 0.925Hence required probability = 1-0.925 = 0.075

Page 17: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Illustration

Q. An Economist wishes to estimate the average family income in a certain population. The population SD is known to be $4,500 and the economist uses a random sample of size 225. What is the probability that the sample mean will fall within $800 of the population mean?

Page 18: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Point and Interval Estimation• The value of an estimator (see next slide), obtained from a sample

can be used to estimate the value of the population parameter. Such an estimate is called a point estimate.

• This is a 50:50 estimate, in the sense, the actual parameter value is equally likely to be on either side of the point estimate.

• A more useful estimate is the interval estimate, where an interval is specified along with a measure of confidence (90%, 95%, 99% etc)

• The interval estimate with its associated measure of confidence is called a confidence interval.

• A confidence interval is a range of numbers believed to include the unknown population parameter, with a certain level of confidence

Page 19: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Estimators

• Population parameters (, 2, p) and Sample Statistics (x,s2, ps)

• An estimator of a population parameter is a sample statistic used to estimate the parameter

• Statistic,x is an estimator of parameter • Statistic, s2 is an estimator of parameter 2

• Statistic, ps is an estimator of parameter p

Page 20: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Illustration

Q. A wine importer needs to report the average percentage of alcohol in bottles of French wine. From experience with previous kinds of wine, the importer believes the population SD is 1.2%. The importer randomly samples 60 bottles of the new wine and obtains a sample mean of 9.3%. Find the 90% confidence interval for the average percentage of alcohol in the population.

Page 21: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Answer

Standard Error = 1.2%/sqrt(60) = 0.1549%

For 90% confidence interval, Z = 1.645

Hence the margin of error = 1.645*0.1549%

= 0.2548%

Hence 90% confidence interval is

9.3% +/- 0.3%

Page 22: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

More Sampling Distributions

• Sampling Distribution is the probability distribution of a given test statistic (e.g. Z), which is a numerical quantity calculated from sample statistic

• Sampling distribution depends on the distribution of the population, the statistic being considered and the sample size

• Distribution of Sample Mean: Z or t distribution

• Distribution of Sample Proportion: Z (large sample)

• Distribution of Sample Variance: Chi-square distribution

Page 23: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

The t-distribution

• The t-distribution is also bell-shaped and very similar to the Z(0,1) distribution

• Its mean is 0 and variance is df/(df-2)

• df = degrees of freedom = n-1 & n = sample size

• For large sample size, t &Z are identical

• For small n, the variance of t is larger than that of Z and hence wider tails, indicating the uncertainty introduced by unknown population SD or smaller sample size n

Page 24: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business
Page 25: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Illustration

Q. A large drugstore wants to estimate the average weekly sales for a brand of soap. A random sample of 13 weeks gives the following numbers: 123, 110, 95, 120, 87, 89, 100, 105, 98, 88, 75, 125, 101. Determine the 90% confidence interval for average weekly sales.

A. Sample mean = 101.23 and Sample SD = 15.13. From t-table, for 90% confidence at df = 12 is t = 1.782. Hence Margin of Error = 1.782 * 15.13/sqrt(13) = 7.48. The 90% confidence interval is (93.75,108.71)

Page 26: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Chi-Square Distribution

• Chi-square distribution is the probability distribution of the sum of several independent squared Z variables

• It has a df parameter associated with it (like t distribution).

• Being a sum of squares, the chi-squares cannot be negative and hence the distribution curve is entirely on the positive side, skewed to the right.

The mean is df and variance is 2df

Page 27: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Confidence Interval for population variance using chi-square distribution

Q. A random sample of 30 gives a sample variance of 18,540 for a certain variable. Give a 95% confidence interval for the population variance

A. Point estimate for population variance = 18,540Given df = 29, excel gives chi-square values: For 2.5%, 45.7 and for 97.5, 16.0Hence for the population variance, the lower limit of the confidence interval = 18540 *29/45.7 = 11,765 andthe upper limit of the confidence interval = 18540*29/16.0 = 33,604

Page 28: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Chi-Square Distribution

• Chi-square distribution is the probability distribution of the sum of several independent squared Z variables

• It has a df parameter associated with it (like t distribution).

• Being a sum of squares, the chi-squares cannot be negative and hence the distribution curve is entirely on the positive side, skewed to the right.

The mean is df and variance is 2df

Page 29: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Chi-Square Test for Goodness of Fit

• A goodness-of-fit is a statistical test of how sample data support an assumption about the distribution of a population

• Chi-square statistic used isΧ2 = ∑(O-E)2/E, where O is the observed value and E the expected valueThe above value is then compared with the critical value (obtained from table or using excel) for the given df and the required level of significance, α (1% or 5%)

Page 30: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Illustration

Q. A company comes out with a new watch and wants to find out whether people have special preferences for colour or whether all four colours under consideration are equally preferred. A random sample of 80 prospective buyers indicated preferences as follows: 12, 40, 8, 20. Is there a colour preference at 1% significance?

A. Assuming no preference, the expected values would all be 20. Hence the chi-square value is 64/20 + 400/20 + 144/20 + 0 = 30.4For df = 3 and 1% significance, the right tail area is 11.3.

The computed value of 30.4 is far greater than 11.3 and hence deeply in the rejection region. So we reject the assumption of no colour preference.

Page 31: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Q. Following data is about the births of new born babies on various days of the week during the past one year in a hospital. Can we assume that birth is independent of the day of the week? Sun:116, Mon:184, Tue: 148, Wed: 145, Thu: 153, Fri: 150, Sat: 154 (Total: 1050)

Ans: Assuming independence, the expected values would all be 1050/7 = 150. Hence the chi-square value is 342/150+342/150+22/150+52/150+32/150+42/150=2366/150 = 15.77For df = 6 and 5% significance, the right tail area is 12.6.

The computed value of 15.77 is greater than the critical value of 12.6 and hence falls in the rejection region. So we reject the assumption of independence.

Page 32: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Correlation

• Correlation refers to the concomitant variation between two variables in such a way that change in one is associated with a change in the other

• The statistical technique used to analyse the strength and direction of the above association between two variables is called correlation analysis

Page 33: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Correlation and Causation

• Even if an association is established between two variables no cause-effect relationship is implied

• Association between x and y may be looked upon as:– x causes y– y causes x– x and y influence each other (mutual influence)– x and y are both influenced by z, v (influence of third variable)– due to chance (spurious association)

• Hence caution needed while interpreting correlation

Page 34: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Types of Correlations

• Positive (direct) and negative (inverse)– Positive: direction of change is the same– Negative: direction of change is opposite

• Linear and non-linear– Linear: changes are in a constant ratio– Non-linear: ratio of change is varying

• Simple, Partial and Multiple– Simple: Only two variables are involved– Partial: There may be third and other variables, but they are kept

constant– Multiple: Association of multiple variables considered

simultaneously

Page 35: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Scatter Diagrams

Correlation coefficient r = 1 r = - 0.54 r = 0.85

r= - 0.94 r=0.42 r=0.17

Page 36: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Correlation Coefficient

• Correlation coefficient (r) indicates the strength and direction of association

• The value of r is between -1 and +1• -1: perfect negative correlation• +1: perfect positive correlation• Above 0.75: Very high correlation• 0.50 to 0.75: High correlation• 0.25 to 0.50: Low correlation• Below 0.25: Very low correlation

Page 37: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Methods of Correlation Analysis

• Scatter Diagram• A quick approximate visual idea of association

• Karl Pearson’s Coefficient of Correlation• For numeric data measured on interval or ratio scale• r = Cov(x,y) /(SDx SDy)

• Spearman’s Rank Correlation• For ordinal (rank) data• R = 1 – 6 * Sum of Squared Difference of Ranks / [n(n2-1)]

• Method of Least Squares• r2 = bxy * byx, i.e. product of regression coefficients

Page 38: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Karl Pearson Correlation Coefficient (Product-Moment Correlation)

• r = Covariance (x,y) / (SD of x * SD of y)

• Recall: n Var(X) = SSxx, nVar(Y) = SSYY and n Cov(X,Y) = SSXY

• Thus r2 = Cov2(X,Y)/[Var(X) Var(Y)]

= SS2XY / (SSxx * SSYY)

Note: r2 is called coefficient of determination

Page 39: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Sample Problem

The following data refers to two variables, promotional expense (Rs. Lakhs) and sales (‘000 units) collected in the context of a promotional study. Calculate the correlation coefficient

Promo 7 10 9 4 11 5 3

Sales 12 14 13 5 15 7 4

Page 40: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Promo (X) Sales (Y)

X - Ave(X

)Y - Ave

(Y) Sxy Sxx Syy

             

7 12 0 2 0 0 4

10 14 3 4 12 9 16

9 13 2 3 6 4 9

4 5 -3 -5 15 9 25

11 15 4 5 20 16 25

5 7 -2 -3 6 4 9

3 4 -4 -6 24 16 36

             

7 10     83 58 124

Ave(X) Ave(Y)     SSxy SSxx SSyy

Coefficient of Determination, r-squared = 83*83 / (58*124) = 0.95787

Coefficient of Correlation, r = square root of 0.95787 = 0.978708

Page 41: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Spearman’s Rank Correlation Coefficient

• The ranks of 15 students in two subjects A and B are given below. Find Spearman’s Rank Correlation Coefficient(1,10); (2,7); (3,2); (4,6); (5,4); (6,8); (7,3); (8,1); (9,11); (10,15); (11,9); (12,5); (13,14); (14,12) and (15,13)

Solution: SSD of Ranks = 81+25+1+4+1+4+ 16+49+4+25+4+49+1+4+4 = 272

R = 1 – 6*272/(14*15*16) = 0.5143

Hence moderate degree of positive correlation between the ranks of students in the two subjects

Page 42: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Regression Analysis

• Statistical technique for expressing the relationship between two (or more) variables in the form of an equation (regression equation)

• Dependent or response or predicted variable

• Independent or regressor or predictor variable

• Used for prediction or forecasting

Page 43: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Types of Regression Models

• Simple and Multiple Regression Models– Simple: Only one independent variable– Multiple: More than one independent variable

• Linear and Nonlinear Regression Models– Linear: Value of response variable changes in

proportion to the change in predictor so that Y = a+bX

Page 44: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Simple Linear Regression Model

Y = a + bX,

a and b are constants to be determined using the given data

Note: More strictly, we may say: Y = ayx + byxX

To determine a and b solve the following two equations (called “normal equations”):

∑Y = a n + b ∑x ------- (1)

∑YX = a ∑x + b ∑x2 ------- (2)

Page 45: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Calculating Regression Coeff

• Instead of solving the simultaneous equations one may directly use formulae

• For Y = a + bX, i.e. regression of Y on X• byx = SSxy / SSxx

• ayx = Y – byxX where mean values of Y, X are used

• For X = a + bY form (regression of X on Y)• bxy = SSxy / SSyy

• axy = Y – bxyX where mean values of Y, X are used

Page 46: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Example

For the earlier problem of Sales (dependent variable) Vs Promotional expenses (independent variable) set up the simple linear regression model and predict the sales when promotional spending is Rs.13 lakhs

Solution: We need to find a and b in Y = a + bXb = SSxy / SSxx = 83/58 = 1.4310a = Y - bX, at mean = 10 – 1.4310*7 = -0.017

Hence regression equation is Y = -0.017+1.4310XFor X = 13 Lakhs, we get Y = 18.59, i.e. 18,590 units of

predicted sales

Page 47: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Linear Regression using Excel

y = 1.431x - 0.0172

R2 = 0.9579

0

2

4

6

8

10

12

14

16

18

0 2 4 6 8 10 12

Page 48: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Properties of Regression Coeff

• Coefficient of determination r2 = byx * bxy

• If one regression coefficient is greater than one the other is less than one because r2 lies between 0 and 1

• Both regression coeff must have the same sign, which is also the sign of the correlation coeff r

• The regression lines intersect at the means of X and Y

• Each regression coefficient gives the slope of the respective regression line

Page 49: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Coefficient of Determination• Recall:

• SSyy = Sum of squared deviations of Y from the mean

• Let us define

• SSR as sum of squared deviations of estimated (using regression equation) values of Y from the mean

• SSE as the sum of squared deviations of errors (error means actual Y ~ estimated Y)

• It can be shown that:• SSyy = SSR + SSE, i.e. Total Variation = Explained Variation + Unexplained (error)

Variation • r2 = SSR/SSyy = Explained Variation / Total Variation

• Thus r2 represents the proportion of the total variability of the dependent variable y that is accounted for or explained by the independent variable x

Page 50: Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business

Coefficient of Determination for Statistical Validity of Promo-Sales Regression Model

Promo (X) Sales (Y) Ye = -0.017+1.4310X

Squared deviation of Ye from Mean

Squared deviation of Ye from Y

Squared Deviation of Y from Mean

           

7 12 10.00 0.00 4.00 4

10 14 14.29 18.43 0.09 16

9 13 12.86 8.19 0.02 9

4 5 5.71 18.43 0.50 25

11 15 15.72 32.76 0.52 25

5 7 7.14 8.19 0.02 9

3 4 4.28 32.76 0.08 36

           

7 10   118.77 5.22 124

      SSR SSE Ssyy

Coefficient of determination, r-squared = 118.77/124 = 0.957824

Thus 96% of the variation in sales is explained by promo expenses