my entry to the sportsbet/cikm competition

24
The Task My approach Conclusions Competition for the International Conference of Information and Knowledge Management (CIKM) hosted by Sportsbet September 16th 2015 My Entry to the Sportsbet Competition Simone Romano [email protected] @ialuronico Simone Romano My Entry to the Sportsbet Competition

Upload: simone-romano

Post on 12-Apr-2017

309 views

Category:

Science


0 download

TRANSCRIPT

Page 1: My Entry to the Sportsbet/CIKM competition

The Task My approach Conclusions

Competition for the International Conference of Information and KnowledgeManagement (CIKM) hosted by Sportsbet

September 16th 2015

My Entry to the Sportsbet Competition

Simone Romano

[email protected]

@ialuronico

Simone Romano

My Entry to the Sportsbet Competition

Page 2: My Entry to the Sportsbet/CIKM competition

The Task My approach Conclusions

The TaskTask descriptionThe challenges

My approachHow to Build a Model for PredictionsEvaluation of Prediction Error

ConclusionsSummaryWhat I would have done if I had more time

Simone Romano

My Entry to the Sportsbet Competition

Page 3: My Entry to the Sportsbet/CIKM competition

The Task My approach Conclusions

Task description

Task description

Sportbets competition: predict the outcomes of every match in the 2015AFL season showing the probability that Team1 wins versus Team2.E.g. Hawthorn (The Hawks) wins vs Adelaide (The Crows) on the 18th ofSeptember with probability 0.75 (75%)1

Two phases:

The Leaderboard Phase prediction of the outcome of each regular-seasonmatch in the 2015 AFL season.(match results are already known)

The Finals Phase prediction of the outcome of each match in the 2015 AFLFinals Series.(match results are known after AFL Grand Final)

I focused on the Lederboard Phase in order to evaluate the performance of mypredictions because we know the match results

1Implied by the odds for Hawthorn on Monday the 14th of September onhttp://www.sportsbet.com.au/betting/australian-rules/afl

Simone Romano

My Entry to the Sportsbet Competition

Page 4: My Entry to the Sportsbet/CIKM competition

The Task My approach Conclusions

Task description

Task description

Sportbets competition: predict the outcomes of every match in the 2015AFL season showing the probability that Team1 wins versus Team2.E.g. Hawthorn (The Hawks) wins vs Adelaide (The Crows) on the 18th ofSeptember with probability 0.75 (75%)1

Two phases:

The Leaderboard Phase prediction of the outcome of each regular-seasonmatch in the 2015 AFL season.(match results are already known)

The Finals Phase prediction of the outcome of each match in the 2015 AFLFinals Series.(match results are known after AFL Grand Final)

I focused on the Lederboard Phase in order to evaluate the performance of mypredictions because we know the match results

1Implied by the odds for Hawthorn on Monday the 14th of September onhttp://www.sportsbet.com.au/betting/australian-rules/afl

Simone Romano

My Entry to the Sportsbet Competition

Page 5: My Entry to the Sportsbet/CIKM competition

The Task My approach Conclusions

Task description

Data providedThe following datasets were provided:

Teams Name of teams which took part in AFL matches between 2000and 2015.

Players Name of players that have played in at least one matchbetween 2000 and 2015.

Seasons Description, results, and statistics of regular-season (non-finals)matches. E.g. it contains:

I which team is home or awayI venue: venue of the match.I margin: winning margin

Match stats Statistics recorded for a single player for every match (includingfinals) between 2000 and 2015. E.g. it contains:

I number of kicks performedI number of goals

Finals Contains information about the final matches between 2000and 2014

Unplayed Remaining (unplayed) regular-season matches in the 2015season. (Dataset release: end of July 2015)

Simone Romano

My Entry to the Sportsbet Competition

Page 6: My Entry to the Sportsbet/CIKM competition

The Task My approach Conclusions

Task description

Data providedThe following datasets were provided:

Teams Name of teams which took part in AFL matches between 2000and 2015.

Players Name of players that have played in at least one matchbetween 2000 and 2015.

Seasons Description, results, and statistics of regular-season (non-finals)matches. E.g. it contains:

I which team is home or awayI venue: venue of the match.I margin: winning margin

Match stats Statistics recorded for a single player for every match (includingfinals) between 2000 and 2015. E.g. it contains:

I number of kicks performedI number of goals

Finals Contains information about the final matches between 2000and 2014

Unplayed Remaining (unplayed) regular-season matches in the 2015season. (Dataset release: end of July 2015)

Simone Romano

My Entry to the Sportsbet Competition

Page 7: My Entry to the Sportsbet/CIKM competition

The Task My approach Conclusions

The challenges

The Challenges

Target: We want to predict the outcome of matches in the 2015 season usingthe data available.

Challenges

I Take into account the time constraints: when predicting the outcome of amatch we can only use information about past matches

I Obtain low prediction error

SolutionBuild an automated prediction model that incorporates information onmatches played between 2000 and 2014. Given 2 teams, Team1 and Team2,the model predicts the probability for Team1 to win versus Team2.

We wish our model to have low prediction error

Simone Romano

My Entry to the Sportsbet Competition

Page 8: My Entry to the Sportsbet/CIKM competition

The Task My approach Conclusions

The challenges

Evaluation of Prediction Error

Given that we actually know the results of matches in 2015 we can computethe logloss error of our predictions. logloss error is used to score the entries tothe competition.

Useful facts about logloss error

logloss = 0 A team always wins when the model says 100% prob-ability of winning and a team always loses if themodel says 0%. Model generates only 100% and0% probabilities.

logloss = LARGE If it happens that even for just one match the pre-diction of a team winning is 100% probability butthe team actually loses the game.

logloss = 0.693 If all predictions are set to 50%

Simone Romano

My Entry to the Sportsbet Competition

Page 9: My Entry to the Sportsbet/CIKM competition

The Task My approach Conclusions

The challenges

We have to keep in mind that:

I Large probability should be avoided (E.g. 100% or 0%) because just onesingle error can increase a lot the logloss

I Just being conservative we can obtain 0.693

This is not an easy task and some competitors performed really badly:

Simone Romano

My Entry to the Sportsbet Competition

Page 10: My Entry to the Sportsbet/CIKM competition

The Task My approach Conclusions

The TaskTask descriptionThe challenges

My approachHow to Build a Model for PredictionsEvaluation of Prediction Error

ConclusionsSummaryWhat I would have done if I had more time

Simone Romano

My Entry to the Sportsbet Competition

Page 11: My Entry to the Sportsbet/CIKM competition

The Task My approach Conclusions

How to Build a Model for Predictions

Position on the LeaderboardIn two days a managed to finish half way in the Leaderboard with alogloss = 0.640. Position 28 out of 52. The smallest error on the leaderboardis 0.524

Simone Romano

My Entry to the Sportsbet Competition

Page 12: My Entry to the Sportsbet/CIKM competition

The Task My approach Conclusions

How to Build a Model for Predictions

My ApproachWe can build a simple model based on matches between 2000 and 2014 andthe knowledge of:

I The teams that are playing

I Which team is home and which one is away

Example: Hawthorn (The Hawks) vs Adelaide (The Crows)

Season Round Team Home Winner

2011 R01 Adelaide home Adelaide2012 R03 Hawthorn home Hawthorn2013 R06 Adelaide home Hawthorn2014 R17 Adelaide home Hawthorn2015 R12 Adelaide home ?

We could say that Hawthorn is going to win with probability 34

= 75%. Indeed,Hawthorn won.The model learn on the results of past matches to output this probabilityaccording to this rationale

Simone Romano

My Entry to the Sportsbet Competition

Page 13: My Entry to the Sportsbet/CIKM competition

The Task My approach Conclusions

How to Build a Model for Predictions

Adding FeaturesFeature: measurable information about matches which we can use to predictthe outcome for a match in 2015.For example, can “winner margin” in past games help our predictions?

Season Round Team Home Winner Winner margin

2011 R01 Adelaide home Adelaide 202012 R03 Hawthorn home Hawthorn 562013 R06 Adelaide home Hawthorn 112014 R17 Adelaide home Hawthorn 122015 R12 Adelaide home ? ?

We can only use statistics about margin of previous events to predict theprobability of Hawthorn winning in 2015:

I Mean margin of previous events (Hawthorn-Adelaide) ⇒ 14.75

I Maximum margin of previous events (Hawthorn-Adelaide) ⇒ 56

I Minimum margin of previous events (Hawthorn-Adelaide) ⇒ -20

But which one is a good predictor...Simone Romano

My Entry to the Sportsbet Competition

Page 14: My Entry to the Sportsbet/CIKM competition

The Task My approach Conclusions

How to Build a Model for Predictions

Is Mean Margin a good predictor of winning?Distribution of games won according to the Mean Margin computed onprevious games(Red) for matches 2000-2014. Respectively games lost (Blue).Mean Margin is good if these counts are well separated.

Mean Margin in Previous Games-200 -100 0 100 200

Fre

quen

cy

0

20

40

60

80

100

Lose

Win

Insights

I If a team has Mean Margin more than 100 is likely to win

I If a team has Mean Margin less than -90 it is likely to lose

Simone Romano

My Entry to the Sportsbet Competition

Page 15: My Entry to the Sportsbet/CIKM competition

The Task My approach Conclusions

How to Build a Model for Predictions

Min Margin as predictor of winning

Min Margin in Previous Games-200 -100 0 100 200

Fre

quen

cy

0

20

40

60

80

100

Lose

Win

Insights

I If a team has been defeated in the past by as many as 150 points it islikely to lose

Simone Romano

My Entry to the Sportsbet Competition

Page 16: My Entry to the Sportsbet/CIKM competition

The Task My approach Conclusions

How to Build a Model for Predictions

Max Margin as predictor of winning

Max Margin in Previous Games-200 -100 0 100 200

Fre

quen

cy

0

20

40

60

80

100

Lose

Win

Insights

I If a team has won in the past by as many as 150 points it is likely to win

Simone Romano

My Entry to the Sportsbet Competition

Page 17: My Entry to the Sportsbet/CIKM competition

The Task My approach Conclusions

How to Build a Model for Predictions

Other Features

Similarly to the margin of the final score between two teams, we can computethe margin for other statistics:

I Number of Kicks

I Number of Inside 50

I Number of Disposals

I Number of Clearances

Rank of Attributes based on Prediction Errors (Best at the top)

Score2 Name

0.0449 Mean Margin Inside 500.0408 Mean Margin Score0.0361 Max Margin Score0.0325 Mean Margin Disposals

2According to Information Gain

Simone Romano

My Entry to the Sportsbet Competition

Page 18: My Entry to the Sportsbet/CIKM competition

The Task My approach Conclusions

Evaluation of Prediction Error

Evaluation of Prediction Error

I evaluated the model on the prediction of outcomes for 2015 matches:

I logloss = 0.682 without statistics (just knowing the teams that areplaying)

I logloss = 0.640 with statistics

This is obtained with a black-box model (Random Forest) which is accuratebut difficult to interpret.

Can we get a simpler model?

Interestingly, the simplest model obtained automatically from this data is:

(Mean-Margin ≥ -0.25 AND location = home) ⇒ win with probability 63.8%else win with probability 36.8%

However, this shows high error: logloss = 0.689 (It does not take into accountthe actual teams that are playing)

Simone Romano

My Entry to the Sportsbet Competition

Page 19: My Entry to the Sportsbet/CIKM competition

The Task My approach Conclusions

Evaluation of Prediction Error

Remark about Data on Previous matches

We have to be careful about taking into account matches played too long ago.Indeed, the best prediction (according to our features) is obtained only withmatches from 2014:

Least Recent Matches2000 2002 2004 2006 2008 2010 2012 2014

logl

oss

6.2

6.4

6.6

6.8

Error in Prediction

This is probably because 2014 teams a very similar to 2015 teams.

It would be interesting to see which top players moved between teams in thepast years

Simone Romano

My Entry to the Sportsbet Competition

Page 20: My Entry to the Sportsbet/CIKM competition

The Task My approach Conclusions

The TaskTask descriptionThe challenges

My approachHow to Build a Model for PredictionsEvaluation of Prediction Error

ConclusionsSummaryWhat I would have done if I had more time

Simone Romano

My Entry to the Sportsbet Competition

Page 21: My Entry to the Sportsbet/CIKM competition

The Task My approach Conclusions

Summary

Summary

It is possible to predict the outcome of future matches with enough accuracywith 2 days of work:

I Using features obtained from score margin, margin based on number ofinside 50, and number of disposals

I Combining these features using a model (Random Forest logloss = 0.640)and we can get insights from each feature individually

I Knowing that data about recent matches is more helpful

I Small error can be traded for model simplicity

TechnicalitiesI performed feature engineering in Python and predictions with WEKA.

Simone Romano

My Entry to the Sportsbet Competition

Page 22: My Entry to the Sportsbet/CIKM competition

The Task My approach Conclusions

What I would have done if I had more time

What I would have done if I had more time

There are a number of things that can be done to improve my model and I didnot have the chance to try because of time:

I Predict the outcome of a match on round X in 2015 based on matchesplayed in previous rounds in 2015

I Use many other statistics: e.g. handballs, tackles

I Use data about previously played finals

I Introduce player level features: rank all the players based on goals andcount the number of top players a team is going to employ during thematch

I Team strategy features (difficult to encode)

I Use Sportsbet and other companies’ odds (not fair for my entry but itwould be fair in real practice)

Simone Romano

My Entry to the Sportsbet Competition

Page 23: My Entry to the Sportsbet/CIKM competition

The Task My approach Conclusions

What I would have done if I had more time

Other interesting things other then predicting match outcomes...

It would be interesting to analyze data and see:

I if there are players that are correlated with winning/losing games

I characteristics of Brownlow Medal winners

I probabilities of winning after losing the first/second/third quarters

I identifying the ’turning points’ in important matches (which players areinvolved in changing the outcome of a match?)

Simone Romano

My Entry to the Sportsbet Competition

Page 24: My Entry to the Sportsbet/CIKM competition

The Task My approach Conclusions

What I would have done if I had more time

Thank you.

Questions?

Simone Romano

[email protected]

@ialuronico

Simone Romano

My Entry to the Sportsbet Competition