logistic regression -

12
Mogae Media Prediction using Logistic Regression

Upload: sourav-mahajan

Post on 13-Apr-2017

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Logistic Regression -

Mogae Media

Prediction using Logistic Regression

Page 2: Logistic Regression -

Need of logistic regression?Regression allows us to predict an output based on some inputs. For instance, we can predicts someone's height based on their mother's height and father's height. This type of regression is called linearregression because our outcome variable is a continuous real number. 

But what if we wanted to predict something that's not a continuous number?

Let's say we want to predict if it will rain tomorrow. Using ordinary linear regression won't work in this case because it doesn't make sense to treat our outcome as a continuous number - it either will rain, or won't rain.In this case, we use logistic regression, because our outcome variable is one of several categories. 

Page 3: Logistic Regression -

Logistic Regression Regression

Independent Variable

Dependent Variable

Example

Quantitative, Qualitative

Qualitative

Quantitative, Qualitative

Quantitative

Result (Pass, Fail) is the

function of time given to study

Marks obtained is the function of

time given to study

Page 4: Logistic Regression -

Marks

Study Hours

Passing Marks

Study Hours

Result

Pass

Fail

Logistic RegressionRegression

Page 5: Logistic Regression -

Binary logistic regression expression

Y = Dependent Variablesß˚ = Constantß1 = Coefficient of variable X1

X1 = Independent VariablesE = Error Term

BINARY

Page 6: Logistic Regression -

Problem statement & MethodologyThe purpose of campaign is to get 25K customer registered for CRBT. The task can be accomplished by identifying the customers (or prospects) who are most likely to respond out of the total base of around 100 Million users.

We have sample data available for both respondents and non-respondents for the campaign and we used Logistic regression, which allows us to predict a discrete outcome, such as response tracking from a set of variables that may be continuous, discrete, or a mix of any of these. Generally, the dependent or response variable is dichotomous, such as success/failure.

Sample data of Respondents: 13,600 unique subscribers

Sample data of Non Respondents: 14,000 unique subscribers

Page 7: Logistic Regression -

Hypothesis tests • Is an individual predictor variable significant? • Is the overall model significant?• Is Model A significantly better than Model B?

Dataset used in model:

Outcome variable:1: Responded0: Not Responded

Predictors:Average monthly spendOperating system2G data usage(MB)3G Data usage(MB) GenderIncoming messagesHandset TypeAge on handset

Page 8: Logistic Regression -

Variable ListCircle ARPU Handset Maker Gender Rural_UrbanKARNATAKA 7741 Min. 0.0 NOKIA 5188 F 398 Rural 9864BIHAR 7352 1st Qu. 43.0 SAMSUNG 4959 M 1474 URBAN 11256ORISSA 1714 Median 106.7 MICROMAX 2375 Not A 19248TAMILNADU 841 Mean 167.1 KARBONN 1373MP 778 3rd Qu. 218.5 LAVA 848KOLKATA 750 Max. 1923.0 (null) 819(Other) 1944 (Other) 5558

Handset Type2 Age on Handset Data_2G Data_3G Data_usageHigh end 4878 1-6 Months 17912.0 Min. 0 Min. 0 High data 4820low end 16242 12-18 Months 359.0 1st Qu. 0.006 1st Qu. 0 low data 1783

6-12 Months 2568.0 Median 0.22 Median 0 Mid user 2515Gt 18Months 281.0 Mean 101.104 Mean 45.94 Non data 12002

3rd Qu. 41.17 3rd Qu. 0Max. 5587.644 Max. 4818.84

SMS_Out Incoming_SMSMin. 0 Min. 0.0 BRANDED FEATURE 87381st Qu. 0 1st Qu. 13.0 CONNECTED, DATA DEVICES495Median 0 Median 28.0 LOCAL FEATURE 5438Mean 20.84 Mean 56.5 OTHER 9123rd Qu. 2 3rd Qu. 55.0 SMART 5422Max. 1388 Max. 2708.0 TABLETS 115

Handset type

Page 9: Logistic Regression -

Logistic Function using RCall: glm(formula = Response ~ ARPU + FINAL_OS + DATA_USAGE_MB + GENDER + SMS_COUNT, family = "binomial", data = logistic) Deviance Residuals: Min 1Q Median 3Q Max -1.8477 -0.3105 -0.2631 -0.2389 2.7828 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.0846569 0.1536134 -13.571 < 2e-16 *** ARPU 0.0022055 0.0001133 19.467 < 2e-16 *** FINAL_OSlow end 0.1433112 0.0803809 1.783 0.0746 . DATA_USAGE_MBlow data 0.5063541 0.1146128 4.418 9.96e-06 *** DATA_USAGE_MBMid user 0.4145942 0.1058467 3.917 8.97e-05 *** DATA_USAGE_MBNon data 0.0646602 0.0862048 0.750 0.4532 GENDERM -0.1887909 0.1468794 -1.285 0.1987 GENDERNot A -1.7300622 0.1342508 -12.887 < 2e-16 *** SMS_COUNT -0.0005477 0.0003265 -1.677 0.0934 . --- Signif. codes: 0 �***� 0.001 �**� 0.01 �*� 0.05 �.� 0.1 � � 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 8987.3 on 21119 degrees of freedom Residual deviance: 8207.8 on 21111 degrees of freedom AIC: 8225.8 Number of Fisher Scoring iterations: 6

Page 10: Logistic Regression -

Logistic Regression InterpretationPredictors Estimates Exp(estimate) Z-value P-value Significance

ARPU 0.21791 1.243 18.493 0.000000002 Highly Significant

Data usage low 0.58973 1.8035 5.101 0.000000337 Highly Significant

Data usage mid 0.51124 1.6673 4.783 0.000001730 Highly Significant

No data usage 0.16117 1.1748 1.957 0.0504 Insignificant

Gender: Male -0.14769 0.8626 -0.958 0.3378 Insignificant

Gender :Unclassified -1.71462 0.180032 -12.180 0.00000002 Highly Significant

AON(6-12months) 1.56461 0.000000535 22.936 0.00000002 Highly Significant

AON(12-18months) -14.43952 4.78078 -0.072 0.9425 Insignificant

AON(< 18 Months) -14.37723 0.000000570 -0.063 0.9496 Insignificant

Page 11: Logistic Regression -

Predicated probability using our modelCases ARPU Data usage Gender Age on Handset Prob(Response)

1 100 Low M 6-12 Months 45.85%

2 300 Low M 6-12 Months 56.89%

3 700 Low M 6-12 Months 75.78%

4 400 Mid F 1-6 Months 26.73%

5 300 Low F 6-12 Months 60.2%

6 400 Low NA 12-18 Months 0.0003%

7 700 Low F 6-12 Months 78.39%

8 700 Low M 1-6 Months 39.56%

9 500 Low F 6-12 Months 70.12%

Page 12: Logistic Regression -

THANK YOU