machine learning project report - wordpress.com€¦ · project objective the project aims to...
TRANSCRIPT
1. Project Objective
The project aims to analyze the crime data provided by CMPD and design a predictive model
implementing classification, clustering, and neuro network algorithms to predict in how many days a
case can be closed. Additionally, we tried to find out if the tweets that mention CMPD have any
relation to our Incident data.
Built predictive models based on a series of inputs: Historical geographic crime patterns Day of week and time of day Weather conditions Special Events Tweeter Data Unemployment
2. Data retrieval
Predictive Crime Analytics
2011-2016
Madlen Ivanova Mansi Dubey
Praneesh Jayaraj
University of North Carolina at
Charlotte
Machine Learning Project
Report
2 | P a g e
Contents
1. Project Objective ...................................................................................................................................... 3
2. Data sources ............................................................................................................................................. 3
3. Data retrieval ............................................................................................................................................ 3
4. Data cleansing and Preparation ............................................................................................................... 4
5. Enriched the Incident dataset with multiple third-party data sets ........................................................ 5
6. Joining Data .............................................................................................................................................. 6
...................................................................................................................................................................... 6
7. Data Evaluation ........................................................................................................................................ 6
...................................................................................................................................................................... 6
8. Data Exploration ..................................................................................................................................... 11
.................................................................................................................................................................... 11
9. Feature engineering .............................................................................................................................. 12
10. Modeling .............................................................................................................................................. 12
........................................................................................................................................................ 22
11. References ............................................................................................................................................ 22
3 | P a g e
1. Project Objective
The project aims to analyze the crime data provided by CMPD and design predictive models implementing classification and neural networks to predict:
In how many days a case can be closed
Number of crimes to occur
Case status of an Incident
Our goal is to build predictive models based on different inputs, evaluate them and choose the best one.
2. Data sources
Our main dataset was provided by CMPD. Here is a list with the additional input data and its source:
Day of week and time of day – feature engineering
Weather conditions - www.wunderground.com
Special Events – data.gov
Unemployment – United States Department of Labor
Twitter data - IBM Watson Analytics for Social Media
3. Data retrieval
It took many emails and couple of meetings with CMPD (Charlotte-Mecklenburg Police Department)
to discuss the process that we had to go through in order to be allowed access to the CMPD Incident
data. The data was provided to us through the CMPD web site. Username and password were
assigned to us so we can securely get the data. The data was provided as plain text but it was not
available to download. It took us 1 week to find a way to extract it. Since none of the software we
tried worked properly, we had to use a C# program to connect to the CMPD web site, establish a
secure connection and download the data. The data was imported into an MS SQL Server for further
analysis.
The Data arrived in plain text format. It was spread from year 2011 to year 2016, where we had 7
tables per year. Overall, we retrieved 42 tables (6 years X 7 tables). We used SQL Management
Studio to merge all the years in order to create just 7 tables. We had to create and design proper
database tables to make the data fit better into the appropriate data types. The Complain_No
4 | P a g e
Colum contained the date/time value in plain text, so we had to read the
column and extract date and time into separate columns. We received 7 tables altogether and
managed to link them.
Here is our entity relationship diagram:
4. Data cleansing and Preparation
We found a lot of discrepancies in the data. It looks like the source database system allowed any
text to be entered into any column (no client side data validations and no back-end data type
enforcements). For example, it was normal to see CityName in the ZipCode column, and it was
normal to see the cities misspelled a lot. We chose which fields we were going to use in our analysis
and did basic data cleansing. There were small amounts of records (less than 1%) that did not
contain information that we can work with (no city and zip, so we could not match the records to
proper city) and we removed them. We also deleted all columns, where we had more than 50%
missing values. We used multiple imputation to analyze the completeness of our dataset.
5 | P a g e
The data format was specified.
The type of numeric field (ordinal and continuous) was adjusted properly.
Outliers were replaced with the mean value of the field. Outlier cut off value was set to 3
standard deviations.
All the missing data entries were replaced with:
Continuous fields: replace with mean
Nominal fields: replace with mode
Ordinal fields: replace with median
Dates & Times cannot be used directly by most algorithms, but durations can be computed
and used as model features, so we estimated the duration period.
The features with too many missing values were excluded (> 50%).
The rows with too many missing values were excluded (> 50%).
The fields with too many unique categories were excluded (> 100 categories).
The categorical fields with too many values in a single category were excluded (> 90%).
Sparse categories were merged to maximize association with target. Input fields that have
only one category after supervised merging are excluded.
The dataset was partitioned to training (70%), testing (15%), and validation (15%).
We created several views that allowed us to only view the data that we were interested in.
5. Enriched the Incident dataset with multiple third-party data sets
We merged and exported the most interesting features in a single excel file for modeling. However,
the data that we have could not be used for predictive modeling as there were not many attributes
which could be used for prediction. We augmented multiple datasets to make our data more
insightful. We have used weather, special events, twitter, and unemployment data. Unemployment
data contains unemployment rate and labor force details for each month from 2011 to 2016 in
Charlotte area. We have considered 1 Month net change when downloading the data. The weather
data contained the following features:
Max TemperatureF ,Mean TemperatureF, Min TemperatureF
Max Dew PointF, MeanDew PointF, Min DewpointF
Max Humidity, Mean Humidity, Min Humidity
Max Sea Level PressureIn, Mean Sea Level PressureIn, Min Sea Level PressureIn
Max VisibilityMiles, Mean VisibilityMiles, Min VisibilityMiles]
Max Wind SpeedMPH, Mean Wind SpeedMPH, Max Gust SpeedMPH,
PrecipitationIn, CloudCover Events, WindDirDegrees
6 | P a g e
Since we worked mostly at a “day” level (not on “hour” level) and we did
not have the correct time of the Incident, we considered only the mean values.
The “Special Event” dataset contained start and end date of the event, along with location and
description of the event.
The “Twitter” dataset contained date, number of tweets (per day), and positive/negative sentiment
analysis and it was manually collected using IBM Watson Analytics for Social Media.
6. Joining Data
We used many different tables for our model. The Charlotte Mecklenburg Police Department has
provided multiple tables that include Incident related information. The Incident table is the main
table that connects to the additional tables – offenses, Property, Stolen_Vehicle, Victim_Business,
Victim_Person and Weapons. All tables are linked by the column Complaint_No. The Complaint No
column has encoded information contain the date, time and incremental number which makes each
record unique. Linking to the remaining tables from CMPD is easy as the same format is used in the
other tables. To link the CMPD data to any other data, we broke down the Complaint_No column
into a Date column, DateTime column, and separately we created Year, Month, Day columns.
Additional data about Unemployment, Weather, Twitter, and Special Events datasets were linked to
the CMPD data based on date. The CMPD data include details from outside the Charlotte area,
however our research is limited to the city of Charlotte. To filter the data, we have created a table
ZipCodes and we have included all of the ZipCodes that are for the Charlotte area. This way, we can
link all the data to the ZipCodes table and filter it on the Charlotte zip codes. Additionally, some of
the data came at the Day level, and some data came at the Month level. We had to group data at
the month level so we can properly link it by date (year, month). We performed analysis on both
(year, month) levels.
7. Data Evaluation
We used Tableau, IBM Watson Analytics, and SPSS to learn, ensure data validity, and perform
descriptive analysis.
SPSS: We used the “Descriptives”, “Descriptive Statistics” and the “Frequencies” command to
determine percentiles, quartiles, measures of dispersion, measures of central tendency (mean,
median, and mode), measures of skewness, and to create histograms.
We used Tableau to better visualize the data that we have. Since we worked with little more than
half a million records, we had to livestream the data from the MS SQL virtual cloud Server,
otherwise, the tool was constantly crashing.
7 | P a g e
IBM Watson Analytics was used also to create visualizations and find dependencies between
variables. This tool did not support livestreaming from another server, so we had to upload our
dataset in IBM’s cloud space. We performed a lot of frequency analysis to better understand the
distribution of our data.
Here are some of the interesting data visualizations:
Type of incident distribution:
Day of week over vehicle theft:
8 | P a g e
Day of week over homicide:
Analysis of the number of the distinct Copmplint_No for each table in the CMPD database:
9 | P a g e
Analysis of the vehicle body type that have been stolen most frequently per zip code:
The trend of Number of Tweets over Week Day by Location Type:
10 | P a g e
Number of Incidents compared by year and day of the week:
Trend of the number of Incidents over the Incident hour and the case status:
Number of incidents over mean temperature by location type:
11 | P a g e
The time needed for a case to be resolved over incident hour and location type:
8. Data Exploration
We ran correlation analysis on the numeric features of the weather data. We looked at the pair of
values and we removed one feature of each pair of highly correlated ones. Then, we ran it again to
confirm that we don’t have 1:1 correlation in our dataset. For instance, considering the correlations
between the features below, we removed the “Max Gust SpeedMPH” and the “Mean Humidity”
features from our excel file.
12 | P a g e
Then we ran linear regression in SPSS and we used the VIF (Variance Inflation Factors) value to
detect multicollinearity between variables. For example, in the left picture below you can see that
the “Mean TempretureF” and the “MeanDew PointF” highly influence each other. After removing
the “Mean Dew PointF”, you can see that the VIF score of Mean Temperature become normal and it
is below our threshold, which is 3.
9. Feature engineering
Feature engineering is fundamental to the application of machine learning. To improve our initial
results, we used Microsoft SQL Server Management Studio (SSMS) to create the following features:
We created Day of week and time of day features
Time interval for a case to be closed was calculated from the reported and clearance date
10. Modeling
Method #1: Classification
After the extensive data analysis, we started with data modeling. For the classification, we chose
Decision Tree Modeling. Decision Tree classifier is a supervised learning algorithm, which creates a
model to predict the class labels by learning decision rules based on the data features.
We created the decision tree model using IBM SPSS Modeler tool. To measure the quality of the
split, we applied the Gini function for the information gain. Since the predictors are categorical, the
model uses multi-way splits and we have set the minimum change in Gini to 0.0001. To improve the
model accuracy, we have used boosting, considering 10 component models. We have chosen to
favor accuracy over stability, to create a model that can accurately predict the target variable. The
13 | P a g e
model aims at minimizing the misclassification cost. Stopping rule for
building the tree is based on the minimum percentage of records in percent (5%) and child (3%)
branch.
There are many continuous features in the data such as Time-Hour, mean temperature, humidity
etc, which increase the learning time of the model as well as decreases the model accuracy and
performance. Hence, we have converted these features into interval-scaled variables. We have
analyzed the effectiveness of these features on the classification task and considered those intervals
that showcased significant patterns in predicting the target variable.
Mean Temperature converted into following categories:
0-30
31-40
41-50
51-60
61-70
71-80
81-90
We extracted incident hour from Incident date feature and converted into categorical feature as:
00:00 - 03:59 - Midnight
04:00 - 07:59 - EarlyMorning
08:00 - 11:59 - Morning
12:00 - 15:59 - Afternoon
16:00 - 19:59 - Evening
20:00 - 23:59 - Night
We have dropped the highly-correlated attributes and the ones which do not have significant
predictor performance rating. After multiple iterations and different combinations of input features,
we have implemented classification using Decision Tree algorithm to classify/predict Case_Status for
an incident. We partitioned the data into training (70%) and testing (30%) datasets. Model was
trained using the training data and then its performance was analyzed on test data for each trial
using coincident matrix and error rate calculation. If the percentage of correct predictions was less
than 50% we discarded that model and refined it by dropping the features which had low rating on
Predictor importance chart.
We have considered the features which will be available at the time when the incident is reported
like weekday, time, place where the crime has occurred, the agency it has been reported to, location
type like indoor/outdoor, temperature etc. Following are the details of our classification model:
Target: Case_Status - Case_Status is Status of report (Closed/Cleared, Closed/Leads Exhausted,
Further Investigation, Inactive) at time of last update (September, 2016).
Predictors: WeekDay, Month, TimeFrame, Place1, Reporting_Agency, Location_Type, Temp_Range,
Events
14 | P a g e
Partition: Training Dataset (70%) and Test dataset (30%)
Rules:
The predictor variables for Case_Status are classified by its importance as following:
15 | P a g e
We tried with different settings by increasing the allowed depth for
the decision tree and number of component models to be used for boosting to 20. With these
updated settings, the accuracy of the model jumped to almost 70%. In future we plan
continue pruning the model and better the performance.
Evaluation: The predictors as well as the training data size affect the performance of the model. We have
iterated over the model learning process with different feature sets and different partitions. The
final result we got achieves 52.6% accuracy and is not a very good model. But we can utilize
different feature selection techniques and bagging and boosting methodology to improve the
accuracy of the model.
16 | P a g e
Future Work: We will focus on improving the performance of the decision tree model using different feature
selections and try different tools and classification algorithms and compare the results to our
Decision tree model.
Method #2: Neural Networks
Neural networks are the preferred tool for many predictive data mining applications because of
their power and flexibility. To implement Neural Networks, we created dummy variables to
transform categorical variables into numeric. After we pre-processed the data being fed into the
neural network by removal of redundant information, multicollinearity and outlier treatment, and
all the other processes, mentioned in section # 4, we ran the model. We used one of the most
standard algorithms for this type of supervised learning - Multilayer perceptron (MLP). MLP is a
function of predictors (or independent variables) that minimize the prediction error of target
variable/s. Connection weights have been changed after each piece of data is processed, based on
the amount of error in the output compared to the expected result. This is how the learning occurs
in the perceptron, carried out through "backward propagation of errors", attempting to minimize
the loss function. To link the weighted sums of units in a layer to the values of units in the
succeeding layer we used the sigmoid activation function σ(x) = 1/(1+e^(−x)), which takes a real-
valued input (the signal strength after the sum) and squashes it to range between 0 and 1.
We partitioned the dataset into training sample, test sample, and validation (holdout) sample with
ratio 70:15:15. The training sample comprises the data records used to train the neural network.
The testing sample is an independent set of data records used to track errors during training to
prevent overtraining. The holdout sample is second independent set of data records used for
assessing the final neural network. The error for the holdout sample is the one that gives an
"honest" estimate of the predictive ability of the model because the holdout cases were not used to
build the model.
Models:
Model A – “day” level.
The data used was at a level “day”. As a target variable, we used the time interval for a case
to be resolved (called Clearance_Timeframe). Initially, we set the random sample size to
20,000 (about 4 % of our data). Stopping rule for building the neural network was the point
where error cannot be further decreased. We tried different combinations of inputs to
compare them and find the best fit. Additionally, we used different sample sizes and
partitioning ratio. We tried to improve the performance by feature engineering, since MLP
has sensitivity to parameters, but it wasn’t very helpful. Sample increase and dropout of
variables were some of the most useful ways to improve the model.
17 | P a g e
Results: The results, were still not very good. The following list
represents features by importance of the model which have the highest accuracy assessed:
18 | P a g e
Model B- “month” level
The unemployment data was given at a “month” level. To match the level, we had to
aggregate our data and bring it up to the month level as well. Thus, we could not use the
categorical variables, because aggregating them was not appropriate. All other values were
either summed, averaged or counted. After preparing the data, we fed it into our model. As
target variables, we used the time interval for a case to be resolved and the number of
incidents.
Results: The accuracy is better than the previous model and the UnemploymentRate
feature seems to have significant impact. The predictor variables for
NumberOfIncidents were classified by its importance as following:
20 | P a g e
The results on this aggregation level were interesting. They show that the average monthly time for
a case outcome is explained best of the number of incidents, the year, the average incident hour
and the #of Tweets. The predictor variables for ClearanceTimerfame were classified by its
importance as following:
21 | P a g e
Further Evaluation:
The accuracy of the models depends on many factors such as the sample size, the ratio between
the sample size and the number of features used, the relationship between features, initial
weights and biases, the target variable, and the divisions of data into training, validation, and
test sets. These different conditions can lead to very different solutions for the same problem.
For example, if we change the ratio of the training, validation, and test sets to 50: 25: 25, the
accuracy becomes 93%. The result is similar when we include all the features that we did not
consider because of multicollinearity. Additionally, the model accuracy can be enhanced by
boosting and the model stability - by bagging. In Model “A”, the relatively low performance of
the model can be explained by the number of missing values in the sample dataset, noise, and
the selection of features. When we ran the boosting option, it was created an ensemble, which
generates a sequence of models to obtain more accurate prediction. The accuracy of model “A”
jumped from 63.3% to 77.5%. The accuracy of model “B” reached 97.8% for the
NumberOfIncidents target variable and 99.9% for ClearanceTimeFrame target variable. Then we
ran the bagging option, it was created an ensemble using bagging, or bootstrap aggregating,
which generates multiple models to obtain more reliable predictions. The accuracy of the best
generated prediction for model “A” became 70% and the accuracy of model “B” reached 90.8%
for the NumberOfIncidents target variable and 97.4 % for ClearanceTimeFrame target variable.
Following is a table comparing the percent of accuracy of the algorithms used:
MLP + Bagging + Boosting
Model A ClearanceTimeFrame 63.3% 70% 77.5%.
Model B ClearanceTimeFrame 77.6% 97.4 % 99.9%
Model B NumberOfIncidents 76% 90.8% 97.8%
22 | P a g e
Future work:
We can consider adding new data to our existing models, such as the most recent criminal
activity (within past 24-48 hours), recent call for service activity, school data, and house and rent
prices. We could also perform analysis per zip code. Deep learning implementation is another
option. Since it can be trained in an unsupervised or supervised manner for both unsupervised
and supervised learning tasks, we could use it to train a deep network in an unsupervised
manner, before training the network in a supervised one.
11. References
Flatch, P. (n.d.). Machine Learning.
http://machinelearning.wustl.edu/mlpapers/paper_files/AISTATS2010_ErhanCBV10.pdf. (n.d.). Why
Does Unsupervised Pre-training Help Deep Learning?
http://pubs.acs.org/doi/abs/10.1021/ci00027a006. (n.d.). Neural network studies. 1. Comparison of
overfitting and overtraining.
http://web.stanford.edu/~hastie/TALKS/boost.pdf. (n.d.). Boosting.
http://www.jmlr.org/papers/volume15/srivastava14a.old/source/srivastava14a.pdf. (n.d.). Dropout: A
Simple Way to Prevent Neural Networks from Overfitting.
http://www.sussex.ac.uk/its/pdfs/SPSS_Neural_Network_22.pdf. (n.d.).
http://www.sussex.ac.uk/its/pdfs/SPSS_Neural_Network_22.pdf. IBM SPSS Neural Networks 22.
University, N. (n.d.). Deep learning basics.
Yan-yan SONG, Y. L. (n.d.). Decision tree methods: applications for classification and prediction.