r is for racing · – licensing situation is expensive (largely b2b) and restrictive – no...

26
R is For Racing Colin Magee January 2019

Upload: others

Post on 19-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: R is For Racing · – Licensing situation is expensive (largely B2B) and restrictive – No programmable data source –lots of websites or GUI driven databases relied on webscrapingor

RisForRacing

ColinMageeJanuary2019

Page 2: R is For Racing · – Licensing situation is expensive (largely B2B) and restrictive – No programmable data source –lots of websites or GUI driven databases relied on webscrapingor
Page 3: R is For Racing · – Licensing situation is expensive (largely B2B) and restrictive – No programmable data source –lots of websites or GUI driven databases relied on webscrapingor

Overview

• Introduction,contextandhistory• Dataforprogrammatichorseracinganalysis• Exploringhorseracingdata• Buildingmodelswithhistorichorseracingdata• Usingdailydatatofindprofitablepredictions• UsingabettortoplacebetsintoBetfair• Whatnext?Avoidinggambler’sruin!

© 2019

Page 4: R is For Racing · – Licensing situation is expensive (largely B2B) and restrictive – No programmable data source –lots of websites or GUI driven databases relied on webscrapingor

About• Horseracingenthusiast,punter,sometimeowner

– CurrentlymemberofHorseracingBettors’Forumhttp://ukhbf.org/about/ supportedbytheBHA

• Careerwise,execinsoftwareanddatabusinesses– CurrentlyonboardatDoorda www.doorda.com

• Setupwww.betwise.co.uk in2007– InitiallytosupportbookAutomaticExchangeBetting– firstbookonautomatingbettingprocessviaBetfair API

– CreatedSmartform forhorseracing/dataenthusiasts– CurrentlyworkingonRisforRacingwithJayEmersonhttp://www.stat.yale.edu/~jay/

© 2019

Page 5: R is For Racing · – Licensing situation is expensive (largely B2B) and restrictive – No programmable data source –lots of websites or GUI driven databases relied on webscrapingor

Context• Horseracingisinmanywaysthearchetypalpredictiveanalyticsproblem– Constantseriesofexperiments(races)– Bookmakersandpuntershavealwaysusedinformationtoassignprobabilitiesofsuccesstoeachcontender

• Nosurpriseeitherthatthehistoryofbettingonhorseracingiscloselyintertwinedwiththerootsofcomputerscience,programmingandstatistics…

© 2019

Page 6: R is For Racing · – Licensing situation is expensive (largely B2B) and restrictive – No programmable data source –lots of websites or GUI driven databases relied on webscrapingor

Babbage– designedfirstcomputer?

© 2019

Page 7: R is For Racing · – Licensing situation is expensive (largely B2B) and restrictive – No programmable data source –lots of websites or GUI driven databases relied on webscrapingor

Lovelace– firstprogrammer?

© 2019

Page 8: R is For Racing · – Licensing situation is expensive (largely B2B) and restrictive – No programmable data source –lots of websites or GUI driven databases relied on webscrapingor

Butdidyouknow….?Fromthe1840s,Adaturnedherprodigioustalentstowardgamblingandprogrammingtheoutcomesofhorseraces.Amysterious“book”waspassedbetweenLovelaceandBabbageonceaweekthatprobablycontainedaprogramdesignedtopredicthorse-raceresults.

OnMay17,1850,Adawrotetohermother:“Iamafraidyouwilltakenointerestinwhatinterestsmemuchjustnow,—viz:thewinneroftheDerby.”Markus,Julia.LadyByronandHerDaughters

Apparently,Adalost£3,200*bettingontheDerby.

*c.£425,000in2019

© 2019

Page 9: R is For Racing · – Licensing situation is expensive (largely B2B) and restrictive – No programmable data source –lots of websites or GUI driven databases relied on webscrapingor

Ihavenoideawhattheywereupto

• Therewasverylittledataasweunderstandit–possiblytheyweretryingtocalculatetheeffectsofweightinslowingdownhorsesofdifferentagesoverdifferentdistances,butwhoknows?

• Whatistrueisthatinthiscenturyonemightassumeweareinamuchbetterpositiontomasterthisdomainbyapplyingaclassicdatascienceapproach– withmoderncomputingtools,domainknowledge,usefuldataandmodeling expertise

© 2019

Page 10: R is For Racing · – Licensing situation is expensive (largely B2B) and restrictive – No programmable data source –lots of websites or GUI driven databases relied on webscrapingor

…ModernOpportunitiesandChallenges

• Bettingexchanges– givenAPIaccess- indeedprovideopportunitytobequantitativeinbettinganalysisandexecution– WroteAutomaticExchangeBettingin2007aboutthis

• Morechallengingwasthe’fundamentaldata’situationtounderstandthedomain,createmodelsandunderstandwhattobeton– Licensingsituationisexpensive(largelyB2B)andrestrictive– Noprogrammabledatasource– lotsofwebsitesorGUIdrivendatabases

reliedonwebscraping ormanualprocessesfordailydataextraction

• So… forconveniencewecreatedSmartform,anenduserlicensed,programmabledatabaseforhorseracingforenthusiastsanddatascientistsalike

© 2019

Page 11: R is For Racing · – Licensing situation is expensive (largely B2B) and restrictive – No programmable data source –lots of websites or GUI driven databases relied on webscrapingor

Smartform

• https://www.betwise.co.uk/smartform– MySQLdatabase– 15yearsofhorseracinghistoryinUKandIreland,updateddailywithresultsandupcomingraces

– Easytoautomate,importandconnectfrommostanylanguage

• Risanaturalchoiceforanalysisenvironment– Minimizeinteration withMySQLandgetstraighttodataexploration,analysis,modelingetc

– AndpackagesavailableforBetfair APIetc.

© 2019

Page 12: R is For Racing · – Licensing situation is expensive (largely B2B) and restrictive – No programmable data source –lots of websites or GUI driven databases relied on webscrapingor

#Connecttodatabase,easywithRMySQLcon<- dbConnect(MySQL(),

host='127.0.0.1',user=’name',dbname='smartform')

#Databaseissimple,flatstructure– non-normalised– makingiteasytoslurpintoR>dbListTables(con)[1]"daily_races" "daily_runners"[3]"historic_races" "historic_runners”#.…Plussomeextratablesforbetfair historicpricesandeasymappingofentities(morelater)….

#Fordataanalysisandmodeling weonlyreallycareabouthistorictables– race_id iscommonkeyquery<- paste("select*fromhistoric_races;",sep="");#Allracesx<- as.data.frame(dbGetQuery(con,query),stringsAsFactors =FALSE)dim(x)191867 40query2<- paste("select*fromhistoric_runners;",sep="") #Allrunnersinthoseracesy<- as.data.frame(dbGetQuery(con,query2),stringsAsFactors =FALSE)dim(y)2140306 66

© 2019

Page 13: R is For Racing · – Licensing situation is expensive (largely B2B) and restrictive – No programmable data source –lots of websites or GUI driven databases relied on webscrapingor

Exploringhorseracingdata• So,wehaveabout2millionhorses(c.162kareunique)spreadamongst200,000races,withresultsforeachrace

• Despitelookofdatabeing“flat”,thetime/historyelementofracingandinterdependententities(eg.races,horses,sires,dams,trainers,jockeys,owners,coursesetc)enablesustocreatethousandsofderiveddimensionsfordeepermodelling

• Butdatacanstillbemessy:Splitbyjurisdiction,splitwithinjurisdiction,fromdifferentsources:Vigilanceisrequired!

• Let’slookatsomepreprocessing /explorationetc. (Demo)

© 2019

Page 14: R is For Racing · – Licensing situation is expensive (largely B2B) and restrictive – No programmable data source –lots of websites or GUI driven databases relied on webscrapingor

© 2019

Page 15: R is For Racing · – Licensing situation is expensive (largely B2B) and restrictive – No programmable data source –lots of websites or GUI driven databases relied on webscrapingor

Featureengineeringoverview

• Basiccleanuptasks– removingnon-runnersetc.• Wecandoalotmorewithderivingfieldsfromexistingfields(eg.regexonrace_name forraces)

• Second,usefulbinning(eg.distance,going,season)• Lastbutnotleast,anynumberofvariablesbasedonrepresentingentityhistory(upto- butnotincluding-eachrunningdate)inonerow• Horses,Sires,Trainers,Jockeys,Owners,StallNumber• Dimensionsondimensions– eg.Trainer/Jockeycombos

– Dataprepcanbeacomputationalnightmare

© 2019

Page 16: R is For Racing · – Licensing situation is expensive (largely B2B) and restrictive – No programmable data source –lots of websites or GUI driven databases relied on webscrapingor

Modelswithhorseracingdata

• Inallcases,predictingawinnerisnotactuallythepoint,orevendoingitaccurately

• Wearesimplytryingtobeatthemarket

© 2019

Page 17: R is For Racing · – Licensing situation is expensive (largely B2B) and restrictive – No programmable data source –lots of websites or GUI driven databases relied on webscrapingor

Acoupleofdemomodel approaches

• Pickoneofourfavouritefeaturesandlookforprofitableangles(oldfashionedpuntersapproach)

• Amorecomprehensivestatisticalmodeltrainedonpastdataandappliedtonewdata

© 2019

Page 18: R is For Racing · – Licensing situation is expensive (largely B2B) and restrictive – No programmable data source –lots of websites or GUI driven databases relied on webscrapingor

Model#1– findingaprofitableangle

• Sometrainershavegoodstrikerates(#winners/#losers),somenotsomuch

• Allhavedifferingspecialities,butthemarkettendstofocusonaveragetrainerstrike(ifusingitatall)

• WecanusedataanalysisinRtofindniches(usingthenewfeatureswecreatedforseasonalperformance)wheretrainersoutperformtheiraverageform

• Next,andmoreimportant,wecanassesswhetherthemarketgenerallyfactorsintheirchances– ornot

• Automatingthisprocessthrowsupmanycandidates(script)

© 2019

Page 19: R is For Racing · – Licensing situation is expensive (largely B2B) and restrictive – No programmable data source –lots of websites or GUI driven databases relied on webscrapingor

Bettingonthemodel

• Thisissimpleenoughtobegoodforshowingofftheliveselectionprocessandusingabettor• https://github.com/phillc73/abettor/blob/master/vignettes/abettor-placeBet.Rmd

• ImportallracesandrunnersfromdailyracesanddailyrunnersinSmartform

• Findallhorsesrunningtodaybymergingagainstourtrainerhotlist

• Usetheconveniencedailybetfair tablessuppliedinSmartform tolookupbetfair references

• Loopthrougheachselection,andinvokeabettortobetatcurrentmarketprices

© 2019

Page 20: R is For Racing · – Licensing situation is expensive (largely B2B) and restrictive – No programmable data source –lots of websites or GUI driven databases relied on webscrapingor

Model#2– arealmodel!

• Manyapproachestopredictingoutcome• Binomial(winnerorbust)• Multinomial(win,show,place,lose)• Continuousregression(transformfinishpositiontoapointalonganinterval)

• Modeltypes:Rhasafew- takeyourpick!• Withthismodel,Jayappliesclassicregression,basedoncontinuousresponsevariable

© 2019

Page 21: R is For Racing · – Licensing situation is expensive (largely B2B) and restrictive – No programmable data source –lots of websites or GUI driven databases relied on webscrapingor

Aboutthe(Linear)Model

• Response:atransformationofthefinishingposition(1,2,…,kforak-horserace)totheunitinterval[0,1].

• Predictors(examples):– Thehorse’spastwinpercentage– Thejockey’swinpercentageinthelastmonthandoverthelastyear(i.e.isthejockey“inform”)

– Thetrainer’swinpercentageinthelastmonth…– Thestallposition,or“effectofthedraw”– …

© 2019

Page 22: R is For Racing · – Licensing situation is expensive (largely B2B) and restrictive – No programmable data source –lots of websites or GUI driven databases relied on webscrapingor

Disclaimers(fromJay)!

• Weviolatetheindependenceassumption• Weviolatethe“normalerrors”assumption• … but…• Inapracticalsense,itdoesn’tmattermuch--Yes,alittlesecretofreal-worlddataanalysis!

• Wecouldspendalotoftimeworryingaboutthisandtryingtoimproveit… andthepayoffwouldbenegligible.

© 2019

Page 23: R is For Racing · – Licensing situation is expensive (largely B2B) and restrictive – No programmable data source –lots of websites or GUI driven databases relied on webscrapingor

NotesonPrediction(fromJay)

• Predictingwinprobabilitiesdoneviasimulation.Why?– Becausethemodelpredictedresponseislikeanunobserved“strengthofperformance”(latentvariable)… whilewe’reinterestedinrankings:Whowins?Whocomesinsecond,etc…

– … andofcoursethereisahugeamountofvariability(the“errors”)inreal-worldproblemslikethisone!

– Andit’sreallyeasytopulloffwithoutmakingseriousmistakes.

© 2019

Page 24: R is For Racing · – Licensing situation is expensive (largely B2B) and restrictive – No programmable data source –lots of websites or GUI driven databases relied on webscrapingor

Predictions!?January15,2019:UpcomingKemptonRace – 7:45pm

© 2019

NamePredictedFinish

PredictedProb(Win)

PredictedProb(2nd)

PredictedProb(3rd)

PredictedDec Odds

SixStrings 1 21.9 17.6 14.8 4.6

MrGent 2 17.4 15.7 14.4 5.7

InTheRed 3 16.7 15.3 14.3 6

StormMelody 4 11.6 12.3 12.7 8.6

MountWellington 5 10.7 11.9 12.4 9.4

PearlSpectre 6 9.5 11 12 10.5

Wilson 7 6.7 8.6 10.2 14.9

BlueCandy 8 5.4 7.6 9.2 18.4

Page 25: R is For Racing · – Licensing situation is expensive (largely B2B) and restrictive – No programmable data source –lots of websites or GUI driven databases relied on webscrapingor

Whatnext?Gambler’sRuin• Inbetting,mostwagershaveanoveralllosingexpectation–

evenifwearebeatingthemarketperse• Aplayerwithfiniteresourceswillinevitablygobrokeunless

thereiscareful:– Riskmanagement– Bankmanagement

• Theobjectiveistooptimise thesizeofthestakerelativetothewinningexpectationinthebetandthesizeofthebank.

• What’sneeded?Copiousbacktesting andastakingstrategy• PlentyofstrategiessuchasKellyCriterion,andpackages

fromourfriendsinRfinance thatcanhelp,eg– PerformanceAnalytics

© 2019

Page 26: R is For Racing · – Licensing situation is expensive (largely B2B) and restrictive – No programmable data source –lots of websites or GUI driven databases relied on webscrapingor

Thanks!

• Lookoutfor‘RisforRacing’– sometimein2019!

– Blog: https://blog.betwise.net/– Q&A: https://answers.betwise.net/– Contact: https://www.betwise.co.uk/contact

© 2019