machine learning project report - wordpress.com€¦ · project objective the project aims to...

22
Predictive Crime Analytics 2011-2016 Madlen Ivanova Mansi Dubey Praneesh Jayaraj University of North Carolina at Charlotte Machine Learning Project Report

Upload: others

Post on 17-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

1. Project Objective

The project aims to analyze the crime data provided by CMPD and design a predictive model

implementing classification, clustering, and neuro network algorithms to predict in how many days a

case can be closed. Additionally, we tried to find out if the tweets that mention CMPD have any

relation to our Incident data.

Built predictive models based on a series of inputs: Historical geographic crime patterns Day of week and time of day Weather conditions Special Events Tweeter Data Unemployment

2. Data retrieval

Predictive Crime Analytics

2011-2016

Madlen Ivanova Mansi Dubey

Praneesh Jayaraj

University of North Carolina at

Charlotte

Machine Learning Project

Report

2 | P a g e

Contents

1. Project Objective ...................................................................................................................................... 3

2. Data sources ............................................................................................................................................. 3

3. Data retrieval ............................................................................................................................................ 3

4. Data cleansing and Preparation ............................................................................................................... 4

5. Enriched the Incident dataset with multiple third-party data sets ........................................................ 5

6. Joining Data .............................................................................................................................................. 6

...................................................................................................................................................................... 6

7. Data Evaluation ........................................................................................................................................ 6

...................................................................................................................................................................... 6

8. Data Exploration ..................................................................................................................................... 11

.................................................................................................................................................................... 11

9. Feature engineering .............................................................................................................................. 12

10. Modeling .............................................................................................................................................. 12

........................................................................................................................................................ 22

11. References ............................................................................................................................................ 22

3 | P a g e

1. Project Objective

The project aims to analyze the crime data provided by CMPD and design predictive models implementing classification and neural networks to predict:

In how many days a case can be closed

Number of crimes to occur

Case status of an Incident

Our goal is to build predictive models based on different inputs, evaluate them and choose the best one.

2. Data sources

Our main dataset was provided by CMPD. Here is a list with the additional input data and its source:

Day of week and time of day – feature engineering

Weather conditions - www.wunderground.com

Special Events – data.gov

Unemployment – United States Department of Labor

Twitter data - IBM Watson Analytics for Social Media

3. Data retrieval

It took many emails and couple of meetings with CMPD (Charlotte-Mecklenburg Police Department)

to discuss the process that we had to go through in order to be allowed access to the CMPD Incident

data. The data was provided to us through the CMPD web site. Username and password were

assigned to us so we can securely get the data. The data was provided as plain text but it was not

available to download. It took us 1 week to find a way to extract it. Since none of the software we

tried worked properly, we had to use a C# program to connect to the CMPD web site, establish a

secure connection and download the data. The data was imported into an MS SQL Server for further

analysis.

The Data arrived in plain text format. It was spread from year 2011 to year 2016, where we had 7

tables per year. Overall, we retrieved 42 tables (6 years X 7 tables). We used SQL Management

Studio to merge all the years in order to create just 7 tables. We had to create and design proper

database tables to make the data fit better into the appropriate data types. The Complain_No

4 | P a g e

Colum contained the date/time value in plain text, so we had to read the

column and extract date and time into separate columns. We received 7 tables altogether and

managed to link them.

Here is our entity relationship diagram:

4. Data cleansing and Preparation

We found a lot of discrepancies in the data. It looks like the source database system allowed any

text to be entered into any column (no client side data validations and no back-end data type

enforcements). For example, it was normal to see CityName in the ZipCode column, and it was

normal to see the cities misspelled a lot. We chose which fields we were going to use in our analysis

and did basic data cleansing. There were small amounts of records (less than 1%) that did not

contain information that we can work with (no city and zip, so we could not match the records to

proper city) and we removed them. We also deleted all columns, where we had more than 50%

missing values. We used multiple imputation to analyze the completeness of our dataset.

5 | P a g e

The data format was specified.

The type of numeric field (ordinal and continuous) was adjusted properly.

Outliers were replaced with the mean value of the field. Outlier cut off value was set to 3

standard deviations.

All the missing data entries were replaced with:

Continuous fields: replace with mean

Nominal fields: replace with mode

Ordinal fields: replace with median

Dates & Times cannot be used directly by most algorithms, but durations can be computed

and used as model features, so we estimated the duration period.

The features with too many missing values were excluded (> 50%).

The rows with too many missing values were excluded (> 50%).

The fields with too many unique categories were excluded (> 100 categories).

The categorical fields with too many values in a single category were excluded (> 90%).

Sparse categories were merged to maximize association with target. Input fields that have

only one category after supervised merging are excluded.

The dataset was partitioned to training (70%), testing (15%), and validation (15%).

We created several views that allowed us to only view the data that we were interested in.

5. Enriched the Incident dataset with multiple third-party data sets

We merged and exported the most interesting features in a single excel file for modeling. However,

the data that we have could not be used for predictive modeling as there were not many attributes

which could be used for prediction. We augmented multiple datasets to make our data more

insightful. We have used weather, special events, twitter, and unemployment data. Unemployment

data contains unemployment rate and labor force details for each month from 2011 to 2016 in

Charlotte area. We have considered 1 Month net change when downloading the data. The weather

data contained the following features:

Max TemperatureF ,Mean TemperatureF, Min TemperatureF

Max Dew PointF, MeanDew PointF, Min DewpointF

Max Humidity, Mean Humidity, Min Humidity

Max Sea Level PressureIn, Mean Sea Level PressureIn, Min Sea Level PressureIn

Max VisibilityMiles, Mean VisibilityMiles, Min VisibilityMiles]

Max Wind SpeedMPH, Mean Wind SpeedMPH, Max Gust SpeedMPH,

PrecipitationIn, CloudCover Events, WindDirDegrees

6 | P a g e

Since we worked mostly at a “day” level (not on “hour” level) and we did

not have the correct time of the Incident, we considered only the mean values.

The “Special Event” dataset contained start and end date of the event, along with location and

description of the event.

The “Twitter” dataset contained date, number of tweets (per day), and positive/negative sentiment

analysis and it was manually collected using IBM Watson Analytics for Social Media.

6. Joining Data

We used many different tables for our model. The Charlotte Mecklenburg Police Department has

provided multiple tables that include Incident related information. The Incident table is the main

table that connects to the additional tables – offenses, Property, Stolen_Vehicle, Victim_Business,

Victim_Person and Weapons. All tables are linked by the column Complaint_No. The Complaint No

column has encoded information contain the date, time and incremental number which makes each

record unique. Linking to the remaining tables from CMPD is easy as the same format is used in the

other tables. To link the CMPD data to any other data, we broke down the Complaint_No column

into a Date column, DateTime column, and separately we created Year, Month, Day columns.

Additional data about Unemployment, Weather, Twitter, and Special Events datasets were linked to

the CMPD data based on date. The CMPD data include details from outside the Charlotte area,

however our research is limited to the city of Charlotte. To filter the data, we have created a table

ZipCodes and we have included all of the ZipCodes that are for the Charlotte area. This way, we can

link all the data to the ZipCodes table and filter it on the Charlotte zip codes. Additionally, some of

the data came at the Day level, and some data came at the Month level. We had to group data at

the month level so we can properly link it by date (year, month). We performed analysis on both

(year, month) levels.

7. Data Evaluation

We used Tableau, IBM Watson Analytics, and SPSS to learn, ensure data validity, and perform

descriptive analysis.

SPSS: We used the “Descriptives”, “Descriptive Statistics” and the “Frequencies” command to

determine percentiles, quartiles, measures of dispersion, measures of central tendency (mean,

median, and mode), measures of skewness, and to create histograms.

We used Tableau to better visualize the data that we have. Since we worked with little more than

half a million records, we had to livestream the data from the MS SQL virtual cloud Server,

otherwise, the tool was constantly crashing.

7 | P a g e

IBM Watson Analytics was used also to create visualizations and find dependencies between

variables. This tool did not support livestreaming from another server, so we had to upload our

dataset in IBM’s cloud space. We performed a lot of frequency analysis to better understand the

distribution of our data.

Here are some of the interesting data visualizations:

Type of incident distribution:

Day of week over vehicle theft:

8 | P a g e

Day of week over homicide:

Analysis of the number of the distinct Copmplint_No for each table in the CMPD database:

9 | P a g e

Analysis of the vehicle body type that have been stolen most frequently per zip code:

The trend of Number of Tweets over Week Day by Location Type:

10 | P a g e

Number of Incidents compared by year and day of the week:

Trend of the number of Incidents over the Incident hour and the case status:

Number of incidents over mean temperature by location type:

11 | P a g e

The time needed for a case to be resolved over incident hour and location type:

8. Data Exploration

We ran correlation analysis on the numeric features of the weather data. We looked at the pair of

values and we removed one feature of each pair of highly correlated ones. Then, we ran it again to

confirm that we don’t have 1:1 correlation in our dataset. For instance, considering the correlations

between the features below, we removed the “Max Gust SpeedMPH” and the “Mean Humidity”

features from our excel file.

12 | P a g e

Then we ran linear regression in SPSS and we used the VIF (Variance Inflation Factors) value to

detect multicollinearity between variables. For example, in the left picture below you can see that

the “Mean TempretureF” and the “MeanDew PointF” highly influence each other. After removing

the “Mean Dew PointF”, you can see that the VIF score of Mean Temperature become normal and it

is below our threshold, which is 3.

9. Feature engineering

Feature engineering is fundamental to the application of machine learning. To improve our initial

results, we used Microsoft SQL Server Management Studio (SSMS) to create the following features:

We created Day of week and time of day features

Time interval for a case to be closed was calculated from the reported and clearance date

10. Modeling

Method #1: Classification

After the extensive data analysis, we started with data modeling. For the classification, we chose

Decision Tree Modeling. Decision Tree classifier is a supervised learning algorithm, which creates a

model to predict the class labels by learning decision rules based on the data features.

We created the decision tree model using IBM SPSS Modeler tool. To measure the quality of the

split, we applied the Gini function for the information gain. Since the predictors are categorical, the

model uses multi-way splits and we have set the minimum change in Gini to 0.0001. To improve the

model accuracy, we have used boosting, considering 10 component models. We have chosen to

favor accuracy over stability, to create a model that can accurately predict the target variable. The

13 | P a g e

model aims at minimizing the misclassification cost. Stopping rule for

building the tree is based on the minimum percentage of records in percent (5%) and child (3%)

branch.

There are many continuous features in the data such as Time-Hour, mean temperature, humidity

etc, which increase the learning time of the model as well as decreases the model accuracy and

performance. Hence, we have converted these features into interval-scaled variables. We have

analyzed the effectiveness of these features on the classification task and considered those intervals

that showcased significant patterns in predicting the target variable.

Mean Temperature converted into following categories:

0-30

31-40

41-50

51-60

61-70

71-80

81-90

We extracted incident hour from Incident date feature and converted into categorical feature as:

00:00 - 03:59 - Midnight

04:00 - 07:59 - EarlyMorning

08:00 - 11:59 - Morning

12:00 - 15:59 - Afternoon

16:00 - 19:59 - Evening

20:00 - 23:59 - Night

We have dropped the highly-correlated attributes and the ones which do not have significant

predictor performance rating. After multiple iterations and different combinations of input features,

we have implemented classification using Decision Tree algorithm to classify/predict Case_Status for

an incident. We partitioned the data into training (70%) and testing (30%) datasets. Model was

trained using the training data and then its performance was analyzed on test data for each trial

using coincident matrix and error rate calculation. If the percentage of correct predictions was less

than 50% we discarded that model and refined it by dropping the features which had low rating on

Predictor importance chart.

We have considered the features which will be available at the time when the incident is reported

like weekday, time, place where the crime has occurred, the agency it has been reported to, location

type like indoor/outdoor, temperature etc. Following are the details of our classification model:

Target: Case_Status - Case_Status is Status of report (Closed/Cleared, Closed/Leads Exhausted,

Further Investigation, Inactive) at time of last update (September, 2016).

Predictors: WeekDay, Month, TimeFrame, Place1, Reporting_Agency, Location_Type, Temp_Range,

Events

14 | P a g e

Partition: Training Dataset (70%) and Test dataset (30%)

Rules:

The predictor variables for Case_Status are classified by its importance as following:

15 | P a g e

We tried with different settings by increasing the allowed depth for

the decision tree and number of component models to be used for boosting to 20. With these

updated settings, the accuracy of the model jumped to almost 70%. In future we plan

continue pruning the model and better the performance.

Evaluation: The predictors as well as the training data size affect the performance of the model. We have

iterated over the model learning process with different feature sets and different partitions. The

final result we got achieves 52.6% accuracy and is not a very good model. But we can utilize

different feature selection techniques and bagging and boosting methodology to improve the

accuracy of the model.

16 | P a g e

Future Work: We will focus on improving the performance of the decision tree model using different feature

selections and try different tools and classification algorithms and compare the results to our

Decision tree model.

Method #2: Neural Networks

Neural networks are the preferred tool for many predictive data mining applications because of

their power and flexibility. To implement Neural Networks, we created dummy variables to

transform categorical variables into numeric. After we pre-processed the data being fed into the

neural network by removal of redundant information, multicollinearity and outlier treatment, and

all the other processes, mentioned in section # 4, we ran the model. We used one of the most

standard algorithms for this type of supervised learning - Multilayer perceptron (MLP). MLP is a

function of predictors (or independent variables) that minimize the prediction error of target

variable/s. Connection weights have been changed after each piece of data is processed, based on

the amount of error in the output compared to the expected result. This is how the learning occurs

in the perceptron, carried out through "backward propagation of errors", attempting to minimize

the loss function. To link the weighted sums of units in a layer to the values of units in the

succeeding layer we used the sigmoid activation function σ(x) = 1/(1+e^(−x)), which takes a real-

valued input (the signal strength after the sum) and squashes it to range between 0 and 1.

We partitioned the dataset into training sample, test sample, and validation (holdout) sample with

ratio 70:15:15. The training sample comprises the data records used to train the neural network.

The testing sample is an independent set of data records used to track errors during training to

prevent overtraining. The holdout sample is second independent set of data records used for

assessing the final neural network. The error for the holdout sample is the one that gives an

"honest" estimate of the predictive ability of the model because the holdout cases were not used to

build the model.

Models:

Model A – “day” level.

The data used was at a level “day”. As a target variable, we used the time interval for a case

to be resolved (called Clearance_Timeframe). Initially, we set the random sample size to

20,000 (about 4 % of our data). Stopping rule for building the neural network was the point

where error cannot be further decreased. We tried different combinations of inputs to

compare them and find the best fit. Additionally, we used different sample sizes and

partitioning ratio. We tried to improve the performance by feature engineering, since MLP

has sensitivity to parameters, but it wasn’t very helpful. Sample increase and dropout of

variables were some of the most useful ways to improve the model.

17 | P a g e

Results: The results, were still not very good. The following list

represents features by importance of the model which have the highest accuracy assessed:

18 | P a g e

Model B- “month” level

The unemployment data was given at a “month” level. To match the level, we had to

aggregate our data and bring it up to the month level as well. Thus, we could not use the

categorical variables, because aggregating them was not appropriate. All other values were

either summed, averaged or counted. After preparing the data, we fed it into our model. As

target variables, we used the time interval for a case to be resolved and the number of

incidents.

Results: The accuracy is better than the previous model and the UnemploymentRate

feature seems to have significant impact. The predictor variables for

NumberOfIncidents were classified by its importance as following:

19 | P a g e

20 | P a g e

The results on this aggregation level were interesting. They show that the average monthly time for

a case outcome is explained best of the number of incidents, the year, the average incident hour

and the #of Tweets. The predictor variables for ClearanceTimerfame were classified by its

importance as following:

21 | P a g e

Further Evaluation:

The accuracy of the models depends on many factors such as the sample size, the ratio between

the sample size and the number of features used, the relationship between features, initial

weights and biases, the target variable, and the divisions of data into training, validation, and

test sets. These different conditions can lead to very different solutions for the same problem.

For example, if we change the ratio of the training, validation, and test sets to 50: 25: 25, the

accuracy becomes 93%. The result is similar when we include all the features that we did not

consider because of multicollinearity. Additionally, the model accuracy can be enhanced by

boosting and the model stability - by bagging. In Model “A”, the relatively low performance of

the model can be explained by the number of missing values in the sample dataset, noise, and

the selection of features. When we ran the boosting option, it was created an ensemble, which

generates a sequence of models to obtain more accurate prediction. The accuracy of model “A”

jumped from 63.3% to 77.5%. The accuracy of model “B” reached 97.8% for the

NumberOfIncidents target variable and 99.9% for ClearanceTimeFrame target variable. Then we

ran the bagging option, it was created an ensemble using bagging, or bootstrap aggregating,

which generates multiple models to obtain more reliable predictions. The accuracy of the best

generated prediction for model “A” became 70% and the accuracy of model “B” reached 90.8%

for the NumberOfIncidents target variable and 97.4 % for ClearanceTimeFrame target variable.

Following is a table comparing the percent of accuracy of the algorithms used:

MLP + Bagging + Boosting

Model A ClearanceTimeFrame 63.3% 70% 77.5%.

Model B ClearanceTimeFrame 77.6% 97.4 % 99.9%

Model B NumberOfIncidents 76% 90.8% 97.8%

22 | P a g e

Future work:

We can consider adding new data to our existing models, such as the most recent criminal

activity (within past 24-48 hours), recent call for service activity, school data, and house and rent

prices. We could also perform analysis per zip code. Deep learning implementation is another

option. Since it can be trained in an unsupervised or supervised manner for both unsupervised

and supervised learning tasks, we could use it to train a deep network in an unsupervised

manner, before training the network in a supervised one.

11. References

Flatch, P. (n.d.). Machine Learning.

http://machinelearning.wustl.edu/mlpapers/paper_files/AISTATS2010_ErhanCBV10.pdf. (n.d.). Why

Does Unsupervised Pre-training Help Deep Learning?

http://pubs.acs.org/doi/abs/10.1021/ci00027a006. (n.d.). Neural network studies. 1. Comparison of

overfitting and overtraining.

http://web.stanford.edu/~hastie/TALKS/boost.pdf. (n.d.). Boosting.

http://www.jmlr.org/papers/volume15/srivastava14a.old/source/srivastava14a.pdf. (n.d.). Dropout: A

Simple Way to Prevent Neural Networks from Overfitting.

http://www.sussex.ac.uk/its/pdfs/SPSS_Neural_Network_22.pdf. (n.d.).

http://www.sussex.ac.uk/its/pdfs/SPSS_Neural_Network_22.pdf. IBM SPSS Neural Networks 22.

University, N. (n.d.). Deep learning basics.

Yan-yan SONG, Y. L. (n.d.). Decision tree methods: applications for classification and prediction.