analysis of unstructured data

7
An Analysis of Unstructured Data Using: Sklearn’s Stochastic Gradient Descent Classifier

Upload: edgar-gonzalez

Post on 12-Aug-2015

88 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Analysis of Unstructured Data

An Analysis of Unstructured Data Using: Sklearn’s Stochastic Gradient Descent Classifier

Page 3: Analysis of Unstructured Data

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position.

Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.

Page 4: Analysis of Unstructured Data

• Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems.

• This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.

Page 5: Analysis of Unstructured Data

__author__ = 'ExG57'from pandas import read_csvimport numpy as npfrom sklearn.linear_model.stochastic_gradient import SGDClassifierfrom sklearn import preprocessingimport sklearn.metrics as metricsfrom sklearn.cross_validation import train_test_splitd = read_csv("day2.csv", header=None)d = d.values

target = d[:,2]train = d[:,0:]

xTrain, xTest, yTrain, yTest = train_test_split(train, target, test_size=0.25, random_state=0)changeScale = preprocessing.StandardScaler()xTrain = changeScale.fit_transform(xTrain)xTest = changeScale.transform(xTest)

classify = SGDClassifier(loss='log')classify.fit(xTrain, yTrain)

yPredict = classify.predict(xTest)

#classifier qualityprint("The accuracy score for daily data is: ")print(metrics.accuracy_score(yTest, yPredict))print("The classification report is as follows: ")print(metrics.classification_report(yTest, yPredict, target_names=['season', 'weathersit', 'month', 'holiday', 'weekday', 'workday','year', 'temp', 'atemp', 'huimd', 'windspeed', 'casual', 'registered', 'count']))

Page 6: Analysis of Unstructured Data

#confusion matrixprint("Confusion Matrix: ")print(metrics.confusion_matrix(yTest, yPredict))

d = read_csv("hour2.csv", header=None)d = d.values

target = d[:,2]train = d[:,0:]

xTrain, xTest, yTrain, yTest = train_test_split(train, target, test_size=0.25, random_state=0)changeScale = preprocessing.StandardScaler()xTrain = changeScale.fit_transform(xTrain)xTest = changeScale.transform(xTest)

classify = SGDClassifier(loss='log')classify.fit(xTrain, yTrain)

yPredict = classify.predict(xTest)

#classificationprint("The accuracy score for hourly data is: ")print(metrics.accuracy_score(yTest, yPredict))print("The classification report is as follows: ")print(metrics.classification_report(yTest, yPredict, target_names=['season', 'year', 'month', 'holiday', 'weekday', 'workday', 'weathersit', 'temp', 'atemp', 'huimd', 'windspeed', 'casual', 'registered', 'count']))

#matrixprint("Confusion Matrix: ")print(metrics.confusion_matrix(yTest, yPredict))

#m.e.o.w.

Page 7: Analysis of Unstructured Data

The accuracy score for daily data is: 0.459016393443

The classification report is as follows: precision recall f1-score support

season 0.69 0.56 0.62 16 year 0.41 0.41 0.41 17

month 0.14 0.27 0.19 11 holiday 0.45 0.38 0.42 13

weekday 0.35 0.43 0.39 14 workday 0.33 0.15 0.21 13 year 0.64 0.80 0.71 20 temp 1.00 0.12 0.22 16 atemp 0.39 0.60 0.47 20 huimd 0.33 0.50 0.40 16

windspeed 0.67 0.31 0.42 13 casual 1.00 0.71 0.83 14

avg / total 0.54 0.46 0.46 183

Confusion Matrix: [[ 9 7 0 0 0 0 0 0 0 0 0 0] [ 4 7 6 0 0 0 0 0 0 0 0 0] [ 0 3 3 3 2 0 0 0 0 0 0 0] [ 0 0 6 5 2 0 0 0 0 0 0 0] [ 0 0 4 2 6 1 1 0 0 0 0 0] [ 0 0 1 1 7 2 2 0 0 0 0 0] [ 0 0 0 0 0 0 16 0 4 0 0 0] [ 0 0 0 0 0 3 5 2 6 0 0 0] [ 0 0 1 0 0 0 1 0 12 6 0 0] [ 0 0 0 0 0 0 0 0 6 8 2 0] [ 0 0 0 0 0 0 0 0 0 9 4 0]

[ 0 0 0 0 0 0 0 0 3 1 0 10]]

The accuracy score for hourly data is: 0.565477560414

The classification report is as follows: precision recall f1-score support

season 1.00 0.95 0.97 351 year 0.64 0.44 0.52 315

month 0.28 0.46 0.35 378 holiday 0.35 0.52 0.41 355

weekday 0.49 0.36 0.41 357 workday 0.57 0.27 0.36 355 weather 0.66 0.74 0.70 363 temp 0.49 0.31 0.38 386 atemp 0.34 0.40 0.37 347 huimd 0.56 0.90 0.69 397

windspeed 0.92 0.40 0.55 359 casual 1.00 1.00 1.00 382

avg / total 0.61 0.57 0.56 4345

Confusion Matrix: [[332 18 0 1 0 0 0 0 0 0 0 0] [ 1 140 161 12 1 0 0 0 0 0 0 0] [ 0 61 173 114 25 3 2 0 0 0 0 0] [ 0 0 149 184 21 1 0 0 0 0 0 0] [ 0 0 92 77 127 49 12 0 0 0 0 0] [ 0 0 32 56 82 95 86 3 1 0 0 0] [ 0 0 0 58 1 2 270 24 8 0 0 0]

[ 0 0 0 12 0 12 30 119 213 0 0 0] [ 0 0 2 12 0 4 9 98 138 84 0 0] [ 0 0 0 7 0 0 0 0 22 356 12 0] [ 0 0 0 0 0 0 0 0 23 194 142 0] [ 0 0 0 0 0 0 0 0 1 0 0 381]]