pm2 project basketball--final edit near

11/29/2016 PM2 Project BasketballFINAL EDIT NEAR

http://localhost:8888/nbconvert/html/PM_Notebooks/world_cup_learningmaster/PM2%20Project%20BasketballFINAL%20EDIT%20NEAR.ipynb?download=… 1/9

Auburn BasketballHere I try to predict Auburn basketball schedule results from the coming 20162017 season. In order to checkthe accuracy of the model I will compare predictions on the previous 20152016 year with their actual results

I'll use a MLP neural network classifier, my inputs will be the past matches (replacing each team name with alot of stats from both), and my output will be a number indicating the result (0 = tie, 1 = wins team1, 2 = winsteam2). I'll be using pybrain for the classifier, pandas to hack my way through the data, and pygal for thegraphs (far easier than matplotlib). And a lot of extra useful things implemented in the utils.py file, mostly toabstract the data processing I need before I feed the classifier.

In [1]: %matplotlib inline import pandas as pd from IPython.display import SVG from utils import get_team_stats, extract_samples, normalize

In [2]: # Used to avoid including tied matches. I found this greatly improves the accuracy. # In basketball there are no ties. If a game is tied at the end of the standard game time, then the game goes into overtime. # Excluding ties ensures that there are no data entry errors since ties should not exist. exclude_ties = True # used to duplicate matches data, reversing the teams (team1‐>team2, and viceversa). # This helps on visualizations, and also improves precision of the predictions avoiding a dependence on the # order of the teams from the input. duplicate_with_reversed = True RAW_MATCHES_FILE = 'auburn_basketball_database_new1.csv' RAW_WINNERS_FILE = 'raw_winners.csv' TEAM_RENAMES_FILE = 'team_renames.csv' def show(graph): '''Small utility to display pygal graphs''' return SVG(graph.render())



In [3]: def get_matches(with_team_stats=False, duplicate_with_reversed=False, exclude_ties=False): """Create a dataframe with matches info.""" matches = pd.DataFrame.from_csv(RAW_MATCHES_FILE) #for column in ('team1', 'team2'): #matches[column] = apply_renames(matches[column]) if duplicate_with_reversed: id_offset = len(matches) matches2 = matches.copy() matches2.rename(columns={'team1': 'team2', 'team2': 'team1', 'score1': 'score2', 'score2': 'score1'}, inplace=True) matches2.index = matches2.index.map(lambda x: x + id_offset) matches = pd.concat((matches, matches2)) def winner_from_score_diff(x): if x > 0: return 1 elif x < 0: return 2 else: return 0 matches['score_diff'] = matches['score1'] ‐ matches['score2'] matches['winner'] = matches['score_diff'] matches['winner'] = matches['winner'].map(winner_from_score_diff) if exclude_ties: matches = matches[matches['winner'] != 0] if with_team_stats: stats = get_team_stats(matches) matches = matches.join(stats, on='team1')\ .join(stats, on='team2', rsuffix='_2') return matches

In [4]: def apply_renames(column): """Apply team renames to a team column from a dataframe.""" with open(TEAM_RENAMES_FILE) as renames_file: renames = dict(l.strip().split(',') for l in renames_file.readlines() if l.strip()) def renamer(team): return renames.get(team, team) return column.map(renamer)



In [5]: matches = get_matches(with_team_stats=True, duplicate_with_reversed=duplicate_with_reversed, exclude_ties=exclude_ties)

In [14]: #Some descriptive statistics

In [6]: print(matches.head())

date score1 score2 team1 team2 winner year score_diff \ Game_ID 1 11/18/2016 44 69 WSSU EKY 2 2008 ‐25 2 12/10/2016 99 106 UWS FAU 2 2009 ‐7 3 11/26/2016 44 62 SMU TCU 2 2008 ‐18 4 11/13/2016 70 99 FIU MONM 2 2009 ‐29 5 11/15/2016 79 99 WSSU UCD 2 2009 ‐20 matches_played matches_won years_played matches_won_percent \ Game_ID 1 108.0 28.0 4.0 25.925926 2 12.0 0.0 5.0 0.000000 3 386.0 190.0 7.0 49.222798 4 374.0 144.0 7.0 38.502674 5 108.0 28.0 4.0 25.925926 matches_played_2 matches_won_2 years_played_2 \ Game_ID 1 386.0 234.0 7.0 2 378.0 152.0 7.0 3 380.0 150.0 7.0 4 374.0 124.0 7.0 5 374.0 130.0 7.0 matches_won_percent_2 Game_ID 1 60.621762 2 40.211640 3 39.473684 4 33.155080 5 34.759358

In [7]: team_stats = get_team_stats(matches)

In [8]: print(team_stats.head())

matches_played matches_won years_played matches_won_percent team ROC 2.0 0.0 1.0 0.000000 SIU 376.0 154.0 7.0 40.957447 PRIN 346.0 228.0 7.0 65.895954 OSU 436.0 340.0 7.0 77.981651 HEND 8.0 0.0 4.0 0.000000



In [9]: ### Split the data set for regression, Bernoulli, SVC import numpy as np from sklearn.metrics import roc_curve, auc import matplotlib.pyplot as plt from sklearn.cross_validation import train_test_split input_features = ['year', 'matches_won_percent', 'matches_won_percent_2'] output_feature = ['winner'] inputs, outputs = extract_samples(matches, input_features, output_feature) normalizer, inputs = normalize(inputs) X_train, X_test, y_train, y_test = train_test_split(matches[input_features], matches[output_feature], test_size = 0.2, random_state=12) prediction = dict() from sklearn.naive_bayes import MultinomialNB modelM = MultinomialNB().fit(X_train, y_train) prediction['Multinomial'] = modelM.predict(X_test) from sklearn.naive_bayes import BernoulliNB modelN = BernoulliNB().fit(X_train, y_train) prediction['Bernoulli'] = modelN.predict(X_test) from sklearn import linear_model logreg = linear_model.LogisticRegression(C=1e5) logreg.fit(X_train, y_train) prediction['Logistic'] = logreg.predict(X_test) from sklearn.svm import SVC svc = SVC(C= 1.0, kernel='linear') svc.fit(X_train, y_train) prediction['SVC'] = svc.predict(X_test)



C:\Users\dsm0014\AppData\Local\Continuum\Anaconda2\lib\site‐packages\sklearn\utils\validation.py:515: DataConversionWarning: A column‐vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) C:\Users\dsm0014\AppData\Local\Continuum\Anaconda2\lib\site‐packages\sklearn\utils\validation.py:515: DataConversionWarning: A column‐vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) C:\Users\dsm0014\AppData\Local\Continuum\Anaconda2\lib\site‐packages\sklearn\utils\validation.py:515: DataConversionWarning: A column‐vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) C:\Users\dsm0014\AppData\Local\Continuum\Anaconda2\lib\site‐packages\sklearn\svm\base.py:514: DataConversionWarning: A column‐vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y_ = column_or_1d(y, warn=True)

In [10]: y_test["winner"] = y_test["winner"]‐1

In [11]: # Convert 1 to 0, 2 to 1 def formatt(x): x = x‐1 return xvfunc = np.vectorize(formatt)



In [12]: cmp = 0 colors = ['b', 'g', 'y', 'm', 'k'] for model, predicted in prediction.items(): false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, predicted) roc_auc = auc(false_positive_rate, true_positive_rate) plt.plot(false_positive_rate, true_positive_rate, colors[cmp], label='%s: AUC %0.2f'% (model,roc_auc)) cmp += 1 plt.title('Classifiers comparison with ROC') plt.legend(loc='lower right') plt.plot([0,1],[0,1],'r‐‐') plt.xlim([‐0.1,1.2]) plt.ylim([‐0.1,1.2]) plt.ylabel('True Positive Rate') plt.xlabel('False Positive Rate') plt.show()



In [13]: # select only Auburn games and generate predictions matches_Auburn = matches[matches['team1'] == 'AUB'] # generate predictionsmatches_Auburn['Logistic'] = logreg.predict(matches_Auburn[input_features]) matches_Auburn['Multinomial'] = modelM.predict(matches_Auburn[input_features]) matches_Auburn['Bernoulli'] = modelN.predict(matches_Auburn[input_features]) matches_Auburn['SVC'] = svc.predict(matches_Auburn[input_features]) # print the results columnlist = ['year', 'team1', 'team2', 'winner', 'Logistic', 'Multinomial', 'Bernoulli', 'SVC'] print(matches_Auburn[columnlist].head(20))



C:\Users\dsm0014\AppData\Local\Continuum\Anaconda2\lib\site‐packages\ipykernel\__main__.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas‐docs/stable/indexing.html#indexing‐view‐versus‐copy C:\Users\dsm0014\AppData\Local\Continuum\Anaconda2\lib\site‐packages\ipykernel\__main__.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas‐docs/stable/indexing.html#indexing‐view‐versus‐copy C:\Users\dsm0014\AppData\Local\Continuum\Anaconda2\lib\site‐packages\ipykernel\__main__.py:8: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas‐docs/stable/indexing.html#indexing‐view‐versus‐copy C:\Users\dsm0014\AppData\Local\Continuum\Anaconda2\lib\site‐packages\ipykernel\__main__.py:9: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas‐docs/stable/indexing.html#indexing‐view‐versus‐copy

year team1 team2 winner Logistic Multinomial Bernoulli SVC Game_ID 911 2008 AUB DAY 2 2 2 2 2 1202 2008 AUB XAV 2 2 2 2 2 1847 2008 AUB UVA 1 2 2 2 2 2784 2009 AUB SCAR 2 2 2 2 2 3305 2009 AUB UK 2 2 2 2 2 3385 2009 AUB ARK 1 2 2 2 2 3967 2009 AUB MISS 2 2 2 2 2 4653 2009 AUB UGA 1 2 2 2 2 4841 2009 AUB LSU 2 2 2 2 2 5141 2009 AUB MSST 1 2 2 2 2 5286 2009 AUB ALA 1 2 2 2 2 5687 2009 AUB TENN 2 2 2 2 2 6038 2009 AUB MOSU 2 2 2 2 2 6397 2009 AUB NCST 2 2 2 2 2 6858 2009 AUB AAMU 1 1 1 2 1 7480 2009 AUB FSU 2 2 2 2 2 8666 2010 AUB TENN 2 2 2 2 2 8976 2010 AUB LSU 1 2 2 2 2 9079 2010 AUB VAN 2 2 2 2 2 9749 2010 AUB ARK 2 2 2 2 2

pm2 project basketball--final edit near

Documents