logistic regression -
TRANSCRIPT
Mogae Media
Prediction using Logistic Regression
Need of logistic regression?Regression allows us to predict an output based on some inputs. For instance, we can predicts someone's height based on their mother's height and father's height. This type of regression is called linearregression because our outcome variable is a continuous real number.
But what if we wanted to predict something that's not a continuous number?
Let's say we want to predict if it will rain tomorrow. Using ordinary linear regression won't work in this case because it doesn't make sense to treat our outcome as a continuous number - it either will rain, or won't rain.In this case, we use logistic regression, because our outcome variable is one of several categories.
Logistic Regression Regression
Independent Variable
Dependent Variable
Example
Quantitative, Qualitative
Qualitative
Quantitative, Qualitative
Quantitative
Result (Pass, Fail) is the
function of time given to study
Marks obtained is the function of
time given to study
Marks
Study Hours
Passing Marks
Study Hours
Result
Pass
Fail
Logistic RegressionRegression
Binary logistic regression expression
Y = Dependent Variablesß˚ = Constantß1 = Coefficient of variable X1
X1 = Independent VariablesE = Error Term
BINARY
Problem statement & MethodologyThe purpose of campaign is to get 25K customer registered for CRBT. The task can be accomplished by identifying the customers (or prospects) who are most likely to respond out of the total base of around 100 Million users.
We have sample data available for both respondents and non-respondents for the campaign and we used Logistic regression, which allows us to predict a discrete outcome, such as response tracking from a set of variables that may be continuous, discrete, or a mix of any of these. Generally, the dependent or response variable is dichotomous, such as success/failure.
Sample data of Respondents: 13,600 unique subscribers
Sample data of Non Respondents: 14,000 unique subscribers
Hypothesis tests • Is an individual predictor variable significant? • Is the overall model significant?• Is Model A significantly better than Model B?
Dataset used in model:
Outcome variable:1: Responded0: Not Responded
Predictors:Average monthly spendOperating system2G data usage(MB)3G Data usage(MB) GenderIncoming messagesHandset TypeAge on handset
Variable ListCircle ARPU Handset Maker Gender Rural_UrbanKARNATAKA 7741 Min. 0.0 NOKIA 5188 F 398 Rural 9864BIHAR 7352 1st Qu. 43.0 SAMSUNG 4959 M 1474 URBAN 11256ORISSA 1714 Median 106.7 MICROMAX 2375 Not A 19248TAMILNADU 841 Mean 167.1 KARBONN 1373MP 778 3rd Qu. 218.5 LAVA 848KOLKATA 750 Max. 1923.0 (null) 819(Other) 1944 (Other) 5558
Handset Type2 Age on Handset Data_2G Data_3G Data_usageHigh end 4878 1-6 Months 17912.0 Min. 0 Min. 0 High data 4820low end 16242 12-18 Months 359.0 1st Qu. 0.006 1st Qu. 0 low data 1783
6-12 Months 2568.0 Median 0.22 Median 0 Mid user 2515Gt 18Months 281.0 Mean 101.104 Mean 45.94 Non data 12002
3rd Qu. 41.17 3rd Qu. 0Max. 5587.644 Max. 4818.84
SMS_Out Incoming_SMSMin. 0 Min. 0.0 BRANDED FEATURE 87381st Qu. 0 1st Qu. 13.0 CONNECTED, DATA DEVICES495Median 0 Median 28.0 LOCAL FEATURE 5438Mean 20.84 Mean 56.5 OTHER 9123rd Qu. 2 3rd Qu. 55.0 SMART 5422Max. 1388 Max. 2708.0 TABLETS 115
Handset type
Logistic Function using RCall: glm(formula = Response ~ ARPU + FINAL_OS + DATA_USAGE_MB + GENDER + SMS_COUNT, family = "binomial", data = logistic) Deviance Residuals: Min 1Q Median 3Q Max -1.8477 -0.3105 -0.2631 -0.2389 2.7828 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.0846569 0.1536134 -13.571 < 2e-16 *** ARPU 0.0022055 0.0001133 19.467 < 2e-16 *** FINAL_OSlow end 0.1433112 0.0803809 1.783 0.0746 . DATA_USAGE_MBlow data 0.5063541 0.1146128 4.418 9.96e-06 *** DATA_USAGE_MBMid user 0.4145942 0.1058467 3.917 8.97e-05 *** DATA_USAGE_MBNon data 0.0646602 0.0862048 0.750 0.4532 GENDERM -0.1887909 0.1468794 -1.285 0.1987 GENDERNot A -1.7300622 0.1342508 -12.887 < 2e-16 *** SMS_COUNT -0.0005477 0.0003265 -1.677 0.0934 . --- Signif. codes: 0 �***� 0.001 �**� 0.01 �*� 0.05 �.� 0.1 � � 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 8987.3 on 21119 degrees of freedom Residual deviance: 8207.8 on 21111 degrees of freedom AIC: 8225.8 Number of Fisher Scoring iterations: 6
Logistic Regression InterpretationPredictors Estimates Exp(estimate) Z-value P-value Significance
ARPU 0.21791 1.243 18.493 0.000000002 Highly Significant
Data usage low 0.58973 1.8035 5.101 0.000000337 Highly Significant
Data usage mid 0.51124 1.6673 4.783 0.000001730 Highly Significant
No data usage 0.16117 1.1748 1.957 0.0504 Insignificant
Gender: Male -0.14769 0.8626 -0.958 0.3378 Insignificant
Gender :Unclassified -1.71462 0.180032 -12.180 0.00000002 Highly Significant
AON(6-12months) 1.56461 0.000000535 22.936 0.00000002 Highly Significant
AON(12-18months) -14.43952 4.78078 -0.072 0.9425 Insignificant
AON(< 18 Months) -14.37723 0.000000570 -0.063 0.9496 Insignificant
Predicated probability using our modelCases ARPU Data usage Gender Age on Handset Prob(Response)
1 100 Low M 6-12 Months 45.85%
2 300 Low M 6-12 Months 56.89%
3 700 Low M 6-12 Months 75.78%
4 400 Mid F 1-6 Months 26.73%
5 300 Low F 6-12 Months 60.2%
6 400 Low NA 12-18 Months 0.0003%
7 700 Low F 6-12 Months 78.39%
8 700 Low M 1-6 Months 39.56%
9 500 Low F 6-12 Months 70.12%
THANK YOU