event attendee count prediction

Post on 21-Apr-2017

45 Views

Category:

Data & Analytics

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Event – Webinar Attendee Count Prediction v1.0 pilot

Neeraj Tiwary

Data Scientist

Problem StatementMarketers need to know the prospective attendee count for In-Person/Online events conducted for any product / geographic location during the event setup and budget planning stage. Predicting attendee counts at the time of planning will help in improving the overall success rate of events conducted.

Business Cases1. Predict event attendance with basic event attributes at the time of creation of event.

Impact: This will help event owner/marketer in pre-planning of event.2. Predict the event attendance with registration counts and basic event attributes.

Impact: This will help marketers/event owner to improve #registrations during the duration between event setup and start date.

DataTraining dataset contains 4000 records whereas test dataset contains 800 records.

Architecture

Architecture - ContinuedHere for each use case, we created two separate models and then ensemble them into a wrapper model. The reason for creating two separate models is that to Simplify the problem space Distribution of response variable was suggesting that the data follows gamma distribution. Gamma distribution didn’t have very good support for ZERO inflated kind of problems though Poisson /

Negative Binomial distribution have it. Here the requirement is to predict the number of attendees of any event. This was a count regression

problem, and we can’t use any other regression algorithms like linear / neural network as those follow the ranges from – infinity to + infinity whereas for count variable, it should follow the range from 0 to infinity.

Data Cleansing Trimmed all the variables to remove white spaces Converted all the categorical variable values into lower case Replaced all the null values to “Not Assigned” to have uniformity in the data Data transformation to have proper data values for some common categorical variables Removed low frequency categorical data as those were impacting the model

Missing value imputation Went to the business and derived the missing value with the actual value as far as possible For remaining missing values, used “Multiple Imputation” methods to impute the data as most of the

data were missing at random and belongs to categorical variables.

Feature EngineeringThis is the man step of any model development activity. We need to enhance our features to have a better predictability. Created dummy variables for categorical variables like “Product” and “TargetAudience” by using mtabulate in

R Drop unused levels for all categorical variables. Created “Hour of Day” attribute which will tell that at which hour the event is going to start Created “Month of Day” attribute which will tell that at which month the event is going to start Created “Duration” attribute which will tell the duration of event Created “DaysBetweenEventCreationAndStartDate” attribute which will tell the period between event start

date from its creation date Initially all the data were available in text string. Parsed the data to fetch relevant information. We did the pre-cooking /text parsing of data before landing into R for developing the model

Descriptive Statistics – Attendee Count

Response Variable: Statistics:Attendee Count of a randomly chosen in-person event for a future date

Distribution (Log-Likelihood):

Boxplot Density Plot Histogram

Mean: 28.65435Standard Deviation: 32.89823Skewness: 2.267742Kurtosis: 5.9273

Response Variable - Distribution• Here response variable

“AttendeeCount” follows the Gamma Distribution

• We had many instances (~23%) with ZERO attendee counts for the events

• Since gamma model doesn’t support ZERO response variable, we divided the problem into two sets

1. Zero attendee count problem

2. Non-Zero attendee count problem

Exploratory Analysis - ProgramOwner

•Model1: Logistic Regression• ROC Curve

••

• AUC: •• Confusion

Matrix

Model Output: Business Case 1Model2: Gamma RegressionAccuracy:

Model Parameters

•Model1: Logistic Regression• ROC Curve

••

• AUC:•• Confusion

Matrix

Model Output: Business Case 2Model2: Gamma RegressionAccuracy:

Model Parameters

Model - Actual vs Predicted + Registration

Model - AzureMLWe developed the same model in AzureML and deployed it as web service.

Below is the snippet of the same in excel.

Model - AzureML

MethodologyUsed -> Gamma Regression, Logistic regression,

Tried -> Poisson, Negative Binomial, Neural Networks regression etc

Results Models developed with Gamma / Logistic regression have better results. Marketer will change any attributes and then can check the predicted attendee count score through

AzureML model and based on that score, he/she will be in a better state to take his/her own decision.

Conclusions and Next StepsAfter a thoroughly understanding of the problem, below are my further recommendations to proceed ahead Need to explore Vowpal Wabbit in AzureML Need to embed the model with Power BI reporting

top related