my entry to the sportsbet/cikm competition
TRANSCRIPT
The Task My approach Conclusions
Competition for the International Conference of Information and KnowledgeManagement (CIKM) hosted by Sportsbet
September 16th 2015
My Entry to the Sportsbet Competition
Simone Romano
@ialuronico
Simone Romano
My Entry to the Sportsbet Competition
The Task My approach Conclusions
The TaskTask descriptionThe challenges
My approachHow to Build a Model for PredictionsEvaluation of Prediction Error
ConclusionsSummaryWhat I would have done if I had more time
Simone Romano
My Entry to the Sportsbet Competition
The Task My approach Conclusions
Task description
Task description
Sportbets competition: predict the outcomes of every match in the 2015AFL season showing the probability that Team1 wins versus Team2.E.g. Hawthorn (The Hawks) wins vs Adelaide (The Crows) on the 18th ofSeptember with probability 0.75 (75%)1
Two phases:
The Leaderboard Phase prediction of the outcome of each regular-seasonmatch in the 2015 AFL season.(match results are already known)
The Finals Phase prediction of the outcome of each match in the 2015 AFLFinals Series.(match results are known after AFL Grand Final)
I focused on the Lederboard Phase in order to evaluate the performance of mypredictions because we know the match results
1Implied by the odds for Hawthorn on Monday the 14th of September onhttp://www.sportsbet.com.au/betting/australian-rules/afl
Simone Romano
My Entry to the Sportsbet Competition
The Task My approach Conclusions
Task description
Task description
Sportbets competition: predict the outcomes of every match in the 2015AFL season showing the probability that Team1 wins versus Team2.E.g. Hawthorn (The Hawks) wins vs Adelaide (The Crows) on the 18th ofSeptember with probability 0.75 (75%)1
Two phases:
The Leaderboard Phase prediction of the outcome of each regular-seasonmatch in the 2015 AFL season.(match results are already known)
The Finals Phase prediction of the outcome of each match in the 2015 AFLFinals Series.(match results are known after AFL Grand Final)
I focused on the Lederboard Phase in order to evaluate the performance of mypredictions because we know the match results
1Implied by the odds for Hawthorn on Monday the 14th of September onhttp://www.sportsbet.com.au/betting/australian-rules/afl
Simone Romano
My Entry to the Sportsbet Competition
The Task My approach Conclusions
Task description
Data providedThe following datasets were provided:
Teams Name of teams which took part in AFL matches between 2000and 2015.
Players Name of players that have played in at least one matchbetween 2000 and 2015.
Seasons Description, results, and statistics of regular-season (non-finals)matches. E.g. it contains:
I which team is home or awayI venue: venue of the match.I margin: winning margin
Match stats Statistics recorded for a single player for every match (includingfinals) between 2000 and 2015. E.g. it contains:
I number of kicks performedI number of goals
Finals Contains information about the final matches between 2000and 2014
Unplayed Remaining (unplayed) regular-season matches in the 2015season. (Dataset release: end of July 2015)
Simone Romano
My Entry to the Sportsbet Competition
The Task My approach Conclusions
Task description
Data providedThe following datasets were provided:
Teams Name of teams which took part in AFL matches between 2000and 2015.
Players Name of players that have played in at least one matchbetween 2000 and 2015.
Seasons Description, results, and statistics of regular-season (non-finals)matches. E.g. it contains:
I which team is home or awayI venue: venue of the match.I margin: winning margin
Match stats Statistics recorded for a single player for every match (includingfinals) between 2000 and 2015. E.g. it contains:
I number of kicks performedI number of goals
Finals Contains information about the final matches between 2000and 2014
Unplayed Remaining (unplayed) regular-season matches in the 2015season. (Dataset release: end of July 2015)
Simone Romano
My Entry to the Sportsbet Competition
The Task My approach Conclusions
The challenges
The Challenges
Target: We want to predict the outcome of matches in the 2015 season usingthe data available.
Challenges
I Take into account the time constraints: when predicting the outcome of amatch we can only use information about past matches
I Obtain low prediction error
SolutionBuild an automated prediction model that incorporates information onmatches played between 2000 and 2014. Given 2 teams, Team1 and Team2,the model predicts the probability for Team1 to win versus Team2.
We wish our model to have low prediction error
Simone Romano
My Entry to the Sportsbet Competition
The Task My approach Conclusions
The challenges
Evaluation of Prediction Error
Given that we actually know the results of matches in 2015 we can computethe logloss error of our predictions. logloss error is used to score the entries tothe competition.
Useful facts about logloss error
logloss = 0 A team always wins when the model says 100% prob-ability of winning and a team always loses if themodel says 0%. Model generates only 100% and0% probabilities.
logloss = LARGE If it happens that even for just one match the pre-diction of a team winning is 100% probability butthe team actually loses the game.
logloss = 0.693 If all predictions are set to 50%
Simone Romano
My Entry to the Sportsbet Competition
The Task My approach Conclusions
The challenges
We have to keep in mind that:
I Large probability should be avoided (E.g. 100% or 0%) because just onesingle error can increase a lot the logloss
I Just being conservative we can obtain 0.693
This is not an easy task and some competitors performed really badly:
Simone Romano
My Entry to the Sportsbet Competition
The Task My approach Conclusions
The TaskTask descriptionThe challenges
My approachHow to Build a Model for PredictionsEvaluation of Prediction Error
ConclusionsSummaryWhat I would have done if I had more time
Simone Romano
My Entry to the Sportsbet Competition
The Task My approach Conclusions
How to Build a Model for Predictions
Position on the LeaderboardIn two days a managed to finish half way in the Leaderboard with alogloss = 0.640. Position 28 out of 52. The smallest error on the leaderboardis 0.524
Simone Romano
My Entry to the Sportsbet Competition
The Task My approach Conclusions
How to Build a Model for Predictions
My ApproachWe can build a simple model based on matches between 2000 and 2014 andthe knowledge of:
I The teams that are playing
I Which team is home and which one is away
Example: Hawthorn (The Hawks) vs Adelaide (The Crows)
Season Round Team Home Winner
2011 R01 Adelaide home Adelaide2012 R03 Hawthorn home Hawthorn2013 R06 Adelaide home Hawthorn2014 R17 Adelaide home Hawthorn2015 R12 Adelaide home ?
We could say that Hawthorn is going to win with probability 34
= 75%. Indeed,Hawthorn won.The model learn on the results of past matches to output this probabilityaccording to this rationale
Simone Romano
My Entry to the Sportsbet Competition
The Task My approach Conclusions
How to Build a Model for Predictions
Adding FeaturesFeature: measurable information about matches which we can use to predictthe outcome for a match in 2015.For example, can “winner margin” in past games help our predictions?
Season Round Team Home Winner Winner margin
2011 R01 Adelaide home Adelaide 202012 R03 Hawthorn home Hawthorn 562013 R06 Adelaide home Hawthorn 112014 R17 Adelaide home Hawthorn 122015 R12 Adelaide home ? ?
We can only use statistics about margin of previous events to predict theprobability of Hawthorn winning in 2015:
I Mean margin of previous events (Hawthorn-Adelaide) ⇒ 14.75
I Maximum margin of previous events (Hawthorn-Adelaide) ⇒ 56
I Minimum margin of previous events (Hawthorn-Adelaide) ⇒ -20
But which one is a good predictor...Simone Romano
My Entry to the Sportsbet Competition
The Task My approach Conclusions
How to Build a Model for Predictions
Is Mean Margin a good predictor of winning?Distribution of games won according to the Mean Margin computed onprevious games(Red) for matches 2000-2014. Respectively games lost (Blue).Mean Margin is good if these counts are well separated.
Mean Margin in Previous Games-200 -100 0 100 200
Fre
quen
cy
0
20
40
60
80
100
Lose
Win
Insights
I If a team has Mean Margin more than 100 is likely to win
I If a team has Mean Margin less than -90 it is likely to lose
Simone Romano
My Entry to the Sportsbet Competition
The Task My approach Conclusions
How to Build a Model for Predictions
Min Margin as predictor of winning
Min Margin in Previous Games-200 -100 0 100 200
Fre
quen
cy
0
20
40
60
80
100
Lose
Win
Insights
I If a team has been defeated in the past by as many as 150 points it islikely to lose
Simone Romano
My Entry to the Sportsbet Competition
The Task My approach Conclusions
How to Build a Model for Predictions
Max Margin as predictor of winning
Max Margin in Previous Games-200 -100 0 100 200
Fre
quen
cy
0
20
40
60
80
100
Lose
Win
Insights
I If a team has won in the past by as many as 150 points it is likely to win
Simone Romano
My Entry to the Sportsbet Competition
The Task My approach Conclusions
How to Build a Model for Predictions
Other Features
Similarly to the margin of the final score between two teams, we can computethe margin for other statistics:
I Number of Kicks
I Number of Inside 50
I Number of Disposals
I Number of Clearances
Rank of Attributes based on Prediction Errors (Best at the top)
Score2 Name
0.0449 Mean Margin Inside 500.0408 Mean Margin Score0.0361 Max Margin Score0.0325 Mean Margin Disposals
2According to Information Gain
Simone Romano
My Entry to the Sportsbet Competition
The Task My approach Conclusions
Evaluation of Prediction Error
Evaluation of Prediction Error
I evaluated the model on the prediction of outcomes for 2015 matches:
I logloss = 0.682 without statistics (just knowing the teams that areplaying)
I logloss = 0.640 with statistics
This is obtained with a black-box model (Random Forest) which is accuratebut difficult to interpret.
Can we get a simpler model?
Interestingly, the simplest model obtained automatically from this data is:
(Mean-Margin ≥ -0.25 AND location = home) ⇒ win with probability 63.8%else win with probability 36.8%
However, this shows high error: logloss = 0.689 (It does not take into accountthe actual teams that are playing)
Simone Romano
My Entry to the Sportsbet Competition
The Task My approach Conclusions
Evaluation of Prediction Error
Remark about Data on Previous matches
We have to be careful about taking into account matches played too long ago.Indeed, the best prediction (according to our features) is obtained only withmatches from 2014:
Least Recent Matches2000 2002 2004 2006 2008 2010 2012 2014
logl
oss
6.2
6.4
6.6
6.8
Error in Prediction
This is probably because 2014 teams a very similar to 2015 teams.
It would be interesting to see which top players moved between teams in thepast years
Simone Romano
My Entry to the Sportsbet Competition
The Task My approach Conclusions
The TaskTask descriptionThe challenges
My approachHow to Build a Model for PredictionsEvaluation of Prediction Error
ConclusionsSummaryWhat I would have done if I had more time
Simone Romano
My Entry to the Sportsbet Competition
The Task My approach Conclusions
Summary
Summary
It is possible to predict the outcome of future matches with enough accuracywith 2 days of work:
I Using features obtained from score margin, margin based on number ofinside 50, and number of disposals
I Combining these features using a model (Random Forest logloss = 0.640)and we can get insights from each feature individually
I Knowing that data about recent matches is more helpful
I Small error can be traded for model simplicity
TechnicalitiesI performed feature engineering in Python and predictions with WEKA.
Simone Romano
My Entry to the Sportsbet Competition
The Task My approach Conclusions
What I would have done if I had more time
What I would have done if I had more time
There are a number of things that can be done to improve my model and I didnot have the chance to try because of time:
I Predict the outcome of a match on round X in 2015 based on matchesplayed in previous rounds in 2015
I Use many other statistics: e.g. handballs, tackles
I Use data about previously played finals
I Introduce player level features: rank all the players based on goals andcount the number of top players a team is going to employ during thematch
I Team strategy features (difficult to encode)
I Use Sportsbet and other companies’ odds (not fair for my entry but itwould be fair in real practice)
Simone Romano
My Entry to the Sportsbet Competition
The Task My approach Conclusions
What I would have done if I had more time
Other interesting things other then predicting match outcomes...
It would be interesting to analyze data and see:
I if there are players that are correlated with winning/losing games
I characteristics of Brownlow Medal winners
I probabilities of winning after losing the first/second/third quarters
I identifying the ’turning points’ in important matches (which players areinvolved in changing the outcome of a match?)
Simone Romano
My Entry to the Sportsbet Competition
The Task My approach Conclusions
What I would have done if I had more time
Thank you.
Questions?
Simone Romano
@ialuronico
Simone Romano
My Entry to the Sportsbet Competition