machine learning with statistical imputation for ......approval dataset includes more than 6,000...
TRANSCRIPT
HDSR
Issue1
Machine learning with statistical imputation forpredictingdrugapprovals‡AndrewW.Lo1,2,3*,KienWeiSiah1,2,ChiHeemWong1,21LaboratoryforFinancialEngineering,SloanSchoolofManagement,MassachusettsInstituteofTechnology,Cambridge,Massachusetts,UnitedStatesofAmerica2EECSandCSAIL,MassachusettsInstituteofTechnology,Cambridge,Massachusetts,UnitedStatesofAmerica3SantaFeInstitute,SantaFe,NewMexico,UnitedStatesofAmerica*Correspondingauthor
email:[email protected](A.W.L.)
AbstractWeapplymachine-learningtechniquestopredictdrugapprovalsusingdrug-developmentandclinical-trialdatafrom2003to2015involvingseveralthousanddrug-indicationpairswith over 140 features across 15 disease groups. To deal with missing data, we useimputationmethodsthatallowustofullyexploittheentiredataset,thelargestofitskind.We show that our approach outperforms complete-case analysis, which typically yieldsbiasedinferences.Weachievepredictivemeasuresof0.78,and0.81AUC(“areaunderthereceiveroperatingcharacteristiccurve,”theestimatedprobabilitythataclassifierwillrankapositiveoutcomehigherthananegativeoutcome)forpredictingtransitionsfromphase2to approval and phase 3 to approval, respectively. Using five-year rolling windows, wedocumentan increasingtrend inthepredictivepowerof thesemodels,aconsequenceofimprovingdataqualityandquantity.Themostimportantfeaturesforpredictingsuccessaretrialoutcomes,trialstatus,trialaccrualrates,duration,priorapprovalforanotherindication,andsponsortrackrecords.Weprovideestimatesoftheprobabilityofsuccessforalldrugsinthecurrentpipeline.
Keywords:Drugdevelopment;Clinicaltrials;Healthcare;Machinelearning;Imputation
‡WethankInformaforprovidingusaccesstotheirdataandexpertiseandareparticularlygratefultoChristineBlazynski,MarkGordon,andMichaelHayformanyhelpfulcommentsanddiscussionthroughoutthisproject.WealsothankthemandAnnaBarker,LindaBlackerby,LaraBoro,SteveFinch,HowardFingert,Xiao-LiMeng,EllenMoore,JimReddoch,LaraSullivan,MartyTenenbaum,JamesWade,andthreeanonymousrefereesforspecificcommentsonthismanuscript,andJaynaCummingsandPaigeSammartinoforeditorialassistance.ResearchsupportfromtheMITLaboratoryforFinancialEngineeringisgratefullyacknowledged.Theviewsandopinionsexpressedinthisarticlearethoseoftheauthorsonly,anddonotnecessarilyrepresenttheviewsand opinions of any institution or agency, any of their affiliates or employees, or any of the individualsacknowledgedabove.
HDSR
Issue1Author contributions: Conceptualization, data-license agreement, A.W.L.; Methodology,A.W.L.,K.W.S.andC.H.W.;Investigation,K.W.S.;Writing(firstdraft),K.W.S.;Writing(review,editing,finaldraft),K.W.S.andC.H.W.;Visualization,K.W.S.;Projectsupervision,A.W.L.
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page1of44©2019byLo,Siah,andWong,AllRightsReserved
1 IntroductionWhilemanyrecentmedicalbreakthroughssuchasimmuno-therapies,genetherapies,and
gene-editing techniques offer new hope for patients, they have also made biomedical
innovationriskier,andmorecomplexandexpensive.Thesebreakthroughsgeneratenovel
therapiesforinvestigation,eachofwhichrequiresmanyyearsoftranslationalresearchand
clinicaltesting,costinghundredsofmillionstobillionsofdollarsandyetoftenfaceahigh
likelihoodoffailure(Fernandez,Stein,&Lo,2012).Infact,drugdevelopmentproductivity—
theratioofthenumberofnewdrugsapprovedtoR&Dspendingeachyear—hasdeclined
steadilyoverthepast50yearsdespitescientificandtechnicalprogress.Thisphenomenon,
whichScannell,Blanckley,Boldon, andWarrington (2012) termed “Eroom’sLaw,” as the
reverse of Moore’s Law, suggests that the cost of developing new drugs has doubled
approximatelyeverynineyearssincethe1950s. In the faceofmultipleuncertainties, the
needtoevaluatedrugcandidatesbetterandallocatecapitaltohigh-potentialopportunities
moreefficientlyhasonlyintensified.
Toaddresstheseneeds,inthisarticleweapplymachine-learningtechniquestopredictthe
outcomesofrandomizedclinicaltrials.Machinelearningisaninterdisciplinaryfieldfocused
on tackling pattern recognition problems and building predictive models tomake data-
driven decisions, which is well-suited for this context. Successful applications of these
techniqueshavealreadyrevolutionizedanumberofindustries(e.g.,advertising,marketing,
financeandinsurance,oilandgasexploration)andarepoisedforevengreaterimpactvia
autonomousvehicles,facial-recognitionauthentication,andgeneral-purposerobotics.
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page2of44©2019byLo,Siah,andWong,AllRightsReserved
Drugdevelopershavealreadyappliedmachine-learningtoolstothediscoveryprocessvia
high-throughputscreeningofvastlibrariesofchemicalandbiologicalcompoundstoidentify
drug targets. However, in managing their portfolios of investigational drugs, biopharma
companies typically use unconditional estimates of regulatory approval rates based on
historically observed relative frequencies. Machine learning techniques yield conditional
estimates of success, conditioned on a host of predictive factors known to affect the
likelihood of approval, including drug compound characteristics, clinical trial design,
previoustrialoutcomes,andthesponsortrackrecord.Weshowthatthesefeaturescontain
usefulsignalsaboutdrugdevelopmentoutcomesthatwillallowustoforecasttheoutcome
ofpipelinedevelopmentsmoreaccurately.
Ourmethodologyandresultshaveseveralimplicationsforstakeholdersinthebiomedical
ecosystem.Moreaccurateforecastsofthelikelihoodofsuccessofclinicaltrialswillreduce
theuncertaintysurroundingdrugdevelopment,whichwillincreasetheamountofcapital
thatinvestorsanddrugdevelopersarewillingtoallocatetothisendeavor.Byextension,this
would lower the cost of capital and increase the efficiency of the allocation process.
Specifically, we predict the probability of success of drug candidates in two scenarios:
(1)advancing from phase 2 to regulatory approval and (2) from phase 3 to regulatory
approval(seeFig1).Investorsanddrugdevelopersmayusesuchpredictionstoevaluatethe
risksofdifferentinvestigationaldrugsatdifferentclinicalstages,providingthemwithmuch-
needed transparency. Greater risk transparency is one source of improved financial
efficiencybecauseitfacilitatesmoreaccuratematchingofinvestorriskpreferenceswiththe
risksofbiomedicalinvestmentopportunities.
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page3of44©2019byLo,Siah,andWong,AllRightsReserved
Machine-learningmodels can also offer guidance to scientists, clinicians, and biopharma
professionalsas towhich factorsaremost important indeterminingclinical-trialsuccess,
suggesting ways to improve the drug development process and decelerate or reverse
Eroom’sLaw.
Policymakers and regulators would also benefit from machine-learning predictions,
particularlyfordrug-indicationpairsthatarepredictedtofailwithhighlikelihood—these
cases highlight themostdifficult challenges in biomedicine and underscore the need for
greatergovernmentandphilanthropicsupport.
Fig1.Predictivemodelsforassessingtheprobabilityofapprovalofdrugcandidatesintwoscenarios:(1)afterphase2testing,and(2)afterphase3testing.
Tothebestofourknowledge,ourstudyisthelargestofitskind.Weconstructtwodatasets,
one for each scenario, from two proprietary pharmaceutical pipeline databases,
Pharmaprojects and Trialtrove provided by Informa® (Informa, 2016). The phase-2-to-
approvaldatasetincludesmorethan6,000uniquedrugsfor288indicationsandover14,500
phase2trials,andthephase-3-to-approvaldatasetcontainsmorethan1,400uniquedrugs
for253indicationsandover4,500phase3trials.Thesedatacoverover15indicationgroups.
Incontrast,mostpublishedresearchondrugapprovalpredictionhaveverysmallsample
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page4of44©2019byLo,Siah,andWong,AllRightsReserved
sizes,areconcentratedonspecifictherapeuticareas,andinvolveonlyoneorasmallnumber
of predictive factors: Malik et al. (2014) examined the trial objective responses of 88
anticanceragentsinphase1;Goffin,Baral,Tu,Nomikos,andSeymour(2005)studiedthe
tumorresponseratesof58cytotoxicagentsin100phase1trialsand46agentsin499phase
2trials;El-MaraghiandEisenhauer(2008)lookedattheobjectiveresponsesof19phase2
anticancerdrugs in89singleagent trials; Jardim,Groves,Breitfeld, andKurzrock (2017)
examinedtheresponseratesof80phase3oncologydrugstoidentifyfactorsassociatedwith
failures;andDiMasietal.(2015)analyzed62cancerdrugsandproposedanapprovednew
drug index (ANDI)algorithmwith four factors topredict approval for lead indications in
oncology after phase 2 testing (see SupplementaryMaterials H for a comparison of our
analysistotheirs).
Anotherkeydifferenceinourapproachisthatwedealwithmissingdatausingstatistical
imputation methods. We explore four common approaches to “missingness” and
demonstratetheiradvantagesanddisadvantagesoverdiscardingincompletecases.Withthe
FDA Amendments Act of 2007, drug and clinical trial data collection has been rapidly
expanding,butthesedataareoftensparse,andourdatasetisnoexception.Relatedstudies
(e.g.,DiMasietal.,2015)havetypicallyusedonlycomplete-caseobservations—discarding
clinicaltrialswithanymissinginformation—whichtypicallyeliminateslargeportionsofthe
dataandmayalsoleadtocertainbiases.
Weusemachine-learningtechniquestoformourpredictions,includingcross-validationfor
trainingandaheld-outtestingsetforperformanceevaluation,andusethestandard“area
under the receiver operating characteristic curve” (AUC) metric to measure model
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page5of44©2019byLo,Siah,andWong,AllRightsReserved
performance(AUCistheestimatedprobabilitythataclassifierwillrankapositiveoutcome
higherthananegativeoutcome[Fawcett,2006]).WeachieveAUCsof0.78 forpredicting
phase2toapproval(95%confidenceinterval(CI):[0.75,0.81])and0.81forpredictingphase
3toapproval(95%CI:[0.78,0.83]).Atime-series,walk-forwardanalysisapproachshows
similarresults.Wealsoapplyourmodelstothecurrentdrugpipeline—thatis,alldrugsstill
indevelopmentasoftheendofourdataset—toidentifythecandidatesthathavethehighest
and lowestprobabilitiesofsuccess.Weexaminethe latestdevelopmentstatusesof these
pipelinedrug-indicationpairs—atrue“out-of-sample”experiment(validationondatanot
used inmodel building)—and find that candidateswith higher scores are, indeed,more
likelytoprogresstolaterclinicalstages.Thisindicatesthatourclassifiersdodiscriminate
betweenhigh-andlow-potentialcandidates.
2 MaterialsandmethodsDataThecommercialdatavendorInforma®offerstwodatabasesthatareusedinouranalysis:
Pharmaprojects,whichspecializesindruginformation,andTrialtrove,whichspecializesin
clinical trials information (Informa,2016).These twodatabasesaggregatedrugand trial
informationfromover30,000datasourcesinmorethan150countries,includingcompany
press releases, government drug databases (e.g., Drugs@FDA) and trial databases (e.g.,
Clinicaltrials.gov [Zarin, Tse, Williams, & Carr, 2016], Clinicaltrialsregister.eu [extracted
from EudraCT]), and scientific conferences and publications. Using these sources, we
constructtwodatasetsofdrug-indicationpairs:phase2toapproval(P2APP)andphase3to
approval(P3APP).WeextractclinicaltrialfeaturesfromTrialtrove,andaugmentthisdata
usingdrug features fromPharmaprojects.Applyingmachine-learningalgorithms to these
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page6of44©2019byLo,Siah,andWong,AllRightsReserved
datasetsallowsustoestimate:(1)whetheradrug-indicationpairthathasconcludedphase
2testingwillbeapprovedeventually;and(2)whetherapair thathasconcludedphase3
testingwillbeapprovedeventually.DatacleaningproceduresareoutlinedinSupplementary
MaterialsA.
Weconsiderallindicationsassociatedwithaparticulardrug,asopposedtoonlythelead
indication.Weextract all features that could conceivably correlatewith the likelihoodof
success, from drug compound attributes (31 features from Pharmaprojects profiles) to
clinical trial characteristics (113 features fromTrialtrove). These features are defined in
Table1andSupplementaryMaterialsA.Ingeneral,eachdatasetmaybepartitionedintotwo
disjointsubsets:onewithsamplesthathaveknownoutcomes,andanotherwithsamples
thatarestillinthepipelineatthetimeofsnapshotofthedatabases(thatis,theoutcomes
areunknown).Toprovideintuitionforthecharacteristicsofthesamples,wedescribekey
summarystatisticsofeachsubset.
TheP2APPdatasetconsistsof6,344drug-indicationpairsthathaveendedphase2testing;
thatis,therearenophase2trialsinprogressorplannedinthedatabase.Thephase2trials
inthisdatasetrangefromAugust8,1990toDecember15,2015.Inoursample,4,812pairs
haveknownoutcomes,while1,532pairsarestillinthepipeline.Inthesubsetwithknown
outcomes, we define the development statuses of suspension, termination, and lack of
developmentas“failures”(86.8%),andregistrationandlaunchas“successes”orapprovals
(13.2%).TheP3APPdatasetconsistsof1,870pairsthathaveendedphase3testing,ofwhich
1,610pairshaveknownoutcomes,while260pairsarestillinthepipeline.Forthosepairs
withknownoutcomes,wedefine“failures”(59.1%)and“successes”(40.9%) inthesame
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page7of44©2019byLo,Siah,andWong,AllRightsReserved
fashionas theP2APPdataset.Thephase3 trials inP3APPspan from January1,1988 to
November1,2015.ThesefiguresaresummarizedinTable2.Here,theuseofterms“success”
and“failure”isinthecontextofachievingapproval.Wenotethatourdefinitionof“failures”
canincludedrugdevelopmentprogramsthatareterminatedduetofactorsunrelatedtothe
performanceofthedrug(e.g.,marketconditions,businessdecisions).InSection3Results,
wefindthatthisoutcomevariablehassignificantassociationswithtrialperformanceand
otherfactors.
Thedatasetscover15indicationgroups:alimentary,anti-infective,anti-parasitic,bloodand
clotting, cardiovascular, dermatological, genitourinary, hormonal, immunological,
musculoskeletal,neurological,anti-cancer,rarediseases,respiratory,andsensoryproducts.
Anti-canceragentsmakeupthelargestsubgroupinP2APP,andthesecondlargestinP3APP
(seeTable3).Industry-sponsoredtrialsdominatebothdatasets(seeTable4).Inaggregate,
weobserveadecreasingtrendinsuccessratesoverfive-yearrollingwindowsfrom2003to
2015(seeFig2).
To the best of our knowledge, this sample is the largest of its kind. All prior published
researchinthisliteratureinvolvedfewerthan100drugsor500trials(DiMasietal.,2015;
El-Maraghi & Eisenhauer, 2008; Goffin et al., 2005; Malik et al., 2014). In addition, our
datasetscoveradiversesetofindicationgroups,asopposedtoasingleareasuchasoncology.
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page8of44©2019byLo,Siah,andWong,AllRightsReserved
Table1.DescriptionofparentfeaturesextractedfromPharmaprojectsandTrialtrove.Someparentfeaturesaremulti-label(e.g.,atrialmaybetaggedwithUnitedStatesandUnitedKingdomsimultaneously).Wetransformallmulti-labelparentfeaturesintobinarychildfeatures(1or0).SeeSupplementaryMaterialsAforspecificexamplesofeachfeature.Notethatdrug-indicationpairsforthesamedrughavethesamedrugfeatures;drug-indicationpairsinvolvedinthesametrialhavethesametrialfeatures.
Description TypeDrugFeaturesRoute Routeofadministrationofthedrug,thepathbywhichthedrugistaken
intothebody.Multi-label
Origin Originoftheactiveingredientinthedrug. Multi-labelMedium Mediumofthedrug. Multi-labelBiologicaltargetfamily Familyofproteinsinthebodywhoseactivityismodifiedbythedrug,
resultinginaspecificeffect.Multi-label
Pharmacologicaltargetfamily
Mechanismofactionofthedrug,thebiochemicalinteractionthroughwhichthedrugproducesitspharmacologicaleffect.
Multi-label
Drug-indicationdevelopmentstatus
Currentphaseofdevelopmentofthedrugfortheindication. Binary
Priorapprovalofdrugforanotherindication
Approvalofthedrugforanotherindicationpriortotheindicationunderconsideration(specifictodrug-indicationpair).
Binary
TrialFeaturesDuration Durationofthetrial(fromreportedstartdatetoenddate)indays. ContinuousStudydesign Designofthetrial(keywords). Multi-labelSponsortype Sponsorsofthetrialgroupedbytypes. Multi-labelTherapeuticarea Therapeuticareastargetedbythetrial. Multi-labelTrialstatus Statusofthetrial. BinaryTrialoutcome Resultsofthetrial. Multi-labelTargetaccrual Targetpatientaccrualofthetrial. ContinuousActualaccrual Actualpatientaccrualofthetrial. ContinuousLocations Locationsofthetrialbycountry. Multi-labelNumberofidentifiedsites Numberofsiteswherethetrialwasconducted. ContinuousBiomarkerinvolvement Typeofbiomarkerinvolvementinthetrial. Multi-labelSponsortrackrecord Sponsor’ssuccessindevelopingotherdrugspriortothedrug-indication
pairunderconsideration.Continuous
Investigatorexperience Primaryinvestigator’ssuccessindevelopingotherdrugspriortothedrug-indicationpairunderconsideration.
Continuous
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page9of44©2019byLo,Siah,andWong,AllRightsReserved
Table2.SamplesizesofP2APPandP3APPdatasets.Weconsiderphase2trialinformationinP2APPdatasetsandphase3trialinformationinP3APPdataset.
Counts Drug-indication
PairsPhase2/3Trials
UniqueDrugs
UniqueIndications
UniquePhase2/3Trials
P2APPSuccess 635 2,563 540 173 2,486Failure 4,177 10,328 2,779 263 9,722Pipeline 1,532 2,815 1,189 221 2,713Total 6,344 15,706 4,073 288 14,584
P3APPSuccess 659 1,830 572 171 1,801Failure 951 2,425 764 203 2,360Pipeline 260 494 240 120 480Total 1,870 4,749 1,451 253 4,552
Table 3. Breakdown of drug-indication pairs by indication groups. A drug-indication pairmay havemultipleindicationgrouptags.Forinstance,renalcanceristaggedasbothanti-cancerandrarediseaseinPharmaprojects.
Count P2APP P3APPAll 6,344 1,870Anti-cancer 2,239 409RareDiseases 1,105 259Neurological 1,069 444Alimentary 757 249Immunological 474 101Anti-infective 493 177Respiratory 428 134Musculoskeletal 394 121Cardiovascular 388 158Dermatological 254 45Genitourinary 210 85BloodandClotting 160 97Sensory 137 41Hormonal 17 4Anti-parasitic 8 0
Table4.Breakdownoftrialsbysponsortypes.Atrialmaybesponsoredbymorethanoneparty(e.g.,collaborationbetweenindustrydevelopersandacademia).
Counts P2APP P3APPAll 14,584 4,552OtherPharma 5,432 1,721Top20Pharma 5,322 2,369Academic 4,869 736Government 1,807 314CooperativeGroup 958 230NotforProfit 181 51Generic 52 54ContractResearchOrganization 41 17
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page10of44©2019byLo,Siah,andWong,AllRightsReserved
Fig2.SuccessratesinP2APPandP3APPoverfive-yearrollingwindowsfrom2003-2015.
MissingdataPriortothe2007FDAAmendmentsAct(FDAAA),itwasnotuncommonforinvestigatorsto
release only partial information about pipeline drugs and clinical trials to protect trade
secretsorsimplybecausetherewasnoincentivetodomore.Eventoday,someinvestigators
still do not adhere to the FDAAA-mandated registration policy or submit adequate
registrations.Therefore,allhistoricaldrugdevelopmentdatabaseshavemissingdata.We
notethatthe“missingness”hereislargelyrelatedtothepost-studyreportingofclinicaltrial
dataasopposedtoin-trialdatamissingness(e.g.,censorshipofpaneldataduetopatients
terminatingtrialparticipationprematurely).Intheformercase,thedata(e.g.,trialduration,
trialoutcomes)areusuallyavailabletotheinvestigatorsbutmaynotbereleasedpublicly,
and are thus considered “missing” from our standpoint. Therefore, our dataset may be
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page11of44©2019byLo,Siah,andWong,AllRightsReserved
consideredanapproximationof theactualdataavailable to theFDAand individualdrug
developers.
Fig3,Fig4,Table5,andTable6summarizethepatternsofmissingnessinourdataset(we
excludepipelinedrug-indicationpairsherebecausetheiroutcomesarestillpending).The
missingdatapatternsaremultivariate.Whenconditionedonthelatestlevelofdevelopment,
foranyindication,wefindthatsuccessfuldrugsgenerallyhavelowerlevelsofmissingness
comparedtofaileddrugs.Forinstance,intheP2APPdataset,61%offaileddrugshavean
unknownmedium,while only 15%of approved drugs aremissing this feature.We also
observe that completed trials tend tohavegreater levelsofmissingness than terminated
trials.Betweentwodatasets,wefindthattheP3APPdataset,whichfocusesonphase3drugs
andtrials,generallyhaslessmissingdataforbothdrugandtrialfeaturesthantheP2APP
datasetwhichfocusesonphase2drugsandtrials.Thisisexpectedsincephase3trialsare
primarilyusedtosupportregistrationfilings.
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page12of44©2019byLo,Siah,andWong,AllRightsReserved
Fig3.Missingnesspatternsofdrugfeatures.Eachrowcorrespondstoauniquedrug.Featuresnotincludedinthefigurearecompleteanddonothavemissingvalues.Abbreviations:Devstatus:highestlevelofdevelopmentofadrugforanyindication;Pharma:pharmacologicaltargetfamily;Bio:biologicaltargetfamily.
Fig4.Missingnesspatternsoftrialfeatures.Eachrowcorrespondstoauniqueclinicaltrial.Featuresnotincludedinthefigurearecompleteanddonothavemissingvalues.Abbreviations:Devstatus:highestlevelofdevelopmentof a drug for any indication; Status: trial status; Idsi: numberof identified sites; Actacc: actual accrual;Dura:duration;Taracc:targetaccrual;Loc:locations;Dkw:trialstudydesignkeywords;Outcome:trialoutcomes.
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page13of44©2019byLo,Siah,andWong,AllRightsReserved
Table5.Missingnessindrugfeatureswithrespecttouniquedrugs(seeFig3).Thecolumnheading“Unconditional”referstooverallmissingnesswithoutconditioningonoutcome.
Missingness Unconditional Success Failure
P2APPRoute 0.04 0.00 0.04Pharmacologicaltargetfamily 0.06 0.02 0.07Biologicaltargetfamily 0.32 0.27 0.32Medium 0.53 0.15 0.61
P3APPRoute 0.01 0.00 0.02Pharmacologicaltargetfamily 0.03 0.02 0.04Biologicaltargetfamily 0.27 0.24 0.30Medium 0.35 0.14 0.54
Table6.Missingnessintrialfeatureswithrespecttouniquetrials(seeFig4).Thecolumnheading“Unconditional”referstooverallmissingnesswithoutconditioningontrialstatus.
Missingness Unconditional Completion Termination
P2APPNumberofidentifiedsites 0.10 0.10 0.10Actualaccrual 0.12 0.10 0.22Duration 0.26 0.29 0.05Targetaccrual 0.37 0.42 0.09Locations 0.02 0.02 0.02Studydesignkeywords 0.22 0.24 0.10Trialoutcomes 0.63 0.73 0.11
P3APPNumberofidentifiedsites 0.10 0.09 0.12Actualaccrual 0.12 0.09 0.26Duration 0.17 0.19 0.06Targetaccrual 0.27 0.31 0.09Locations 0.01 0.01 0.02Studydesignkeywords 0.09 0.09 0.06Trialoutcomes 0.53 0.62 0.07
Mostrelatedstudiesdonotreporttheextentofmissingdataintheirsamples,presumably
becausesmallerdatasetswereused.DiMasietal.(2015)reportedmissingdataforsomeof
theirfactors,andaddresseditthroughlistwisedeletion—deletingallobservationswithany
missingfactors.Sincestatisticalestimatorsoftenrequirecompletedata,thisapproachisthe
simplestremedyformissingness.However,itgreatlyreducestheamountofdataavailable
anddecreasesthestatisticalpoweroftheresultingstatistics.Furthermore,listwisedeletion
isvalidonlyunderstrictandunrealisticassumptions(seebelow),andwhensuchconditions
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page14of44©2019byLo,Siah,andWong,AllRightsReserved
areviolated,inferencesarebiased.Inthecurrentstudy,wemakeanefforttoincludeinour
analysis all observed examples, with or without complete features, through the use of
statisticalimputation.
Missingdatamaybeclassified intothreecategories(Rubin,1976):missingcompletelyat
random (MCAR), missing at random (MAR), and missing not-at-random (MNAR). MCAR
refers todatathataremissing forreasonsentirely independentof thedata;MARapplies
whenthemissingnesscanbefullyaccountedforbytheobservedvariables;andMNARrefers
tosituationswhenneitherMCARnorMARisappropriate,inwhichcasetheprobabilityof
missingnessisdependentonthevalueofanunobservedvariable(VanBuuren,2012).See
SupplementaryMaterialsBfortheprecisedefinitionsofeachtypeofmissingness.
IfthemissingnessisMCAR,theobservedsamplescanbeviewedasarandomsubsampleof
the dataset. Consequently, using listwise deletion should not introduce any bias. While
convenient, this assumption is rarely satisfied in practice. In most drug-development
databases,faileddrugsaremorelikelytohavemissingfeaturesthansuccessfuldrugs(see
Table5).Clearly,MCARdoesnothold.
Applyinglistwisedeletionwhenthemissingness isnotMCARcan leadtoseverelybiased
estimates.Moreover,giventhenatureofdrug-developmentreporting,alargeportionofthe
originaldatamaybediscardedifmanyvariableshavemissingvalues.Forthesereasons,the
listwise-deletionapproachadoptedbyDiMasietal.(2015)andothersislessthanideal.
Givenonlytheobserveddata,itisimpossibletotestforMARversusMNAR(Enders,2010).
However, our knowledge of the data-collection process suggests thatMAR is a plausible
startingpoint,andwehypothesizethatthemissingnessindrugandtrialfeaturesismainly
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page15of44©2019byLo,Siah,andWong,AllRightsReserved
accountedforbydrugdevelopmentandtrialstatusesrespectively.OurobservationsinTable
5andTable6supportthisapproach,asthemissingnessproportionsforsomefeaturesdiffer
greatlydependingontheoutcome.
OurassumptionofMARisconsistentwiththedata-collectionmethodologyintheInforma®
databases.DrugprofilesarebuiltupovertimeinPharmaprojects.Asadrugadvancestolater
phases, information about its characteristics becomes more readily available because
investigators releasemore data about pipeline drugs after each phase of clinical testing.
Informa®inputsthisinformationintoitsdatabasesastheybecomeavailableinthepublic
domainorthroughprimaryresearch.Approveddrugsaremorelikelytohavemorecomplete
profiles, while information about failed drugs tends to stay stagnant because no further
studiesareconducted.ItisveryplausiblethattheMARnatureofourdatasetsisanartifact
ofdatacollection,andbyextension,soaresimilarpharmaceuticaldatasetsextractedfrom
thepublicdomainandmaintainedinthesamefashion.Originallyintendedtotrackdrugand
trialactivities,PharmaprojectsandTrialtrovearenotstructuredtokeeptrackofinformation
updatesovertimesincetherewasnouseforit.Withouttimestampsoftheupdates,weare
notabletoeliminatetheMARartifactfromourdatasets.
Inouranalysis,weimputethemissingdataunderthemoreplausibleMARassumptionto
obtain complete datasets. In contrast to listwise deletion,we fill inmissing values using
informationintheobservedvariables.Thisallowsustoutilizedatathatwouldotherwisebe
discarded.Thereafter,wecanapplyalltheusualstatisticalestimatorsto this imputation-
completeddata.
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page16of44©2019byLo,Siah,andWong,AllRightsReserved
MethodsOur analysis consists of two parts thatwe perform in R version 3.2.3. First, we impute
missingvaluestogeneratecompletedatasets.Next,weapplyarangeofmachine-learning
algorithmstobuildpredictivemodelsbasedontheimputeddata.Illustrationofthespecific
componentsofouranalysisappearinFig5.
Fig5.Modelingmethodologyadoptedinthisstudy.Abbreviations:CV:cross-validation.
Weformulateourtwoscenariosassupervisedbipartiterankingproblems,wherethegoalis
topredict theoutcome—successor failure—ofadrug-indicationpairgivenasetof input
features.Initially,wespliteachdatasetintotrainingandtestingsets.Foreachscenario,we
trainvariousclassifiersbasedonthecorrespondingtrainingset,andcomputetheexpected
errorofourpredictivemodelsbytestingthemontheheld-outtestingset.
Wecreatefeaturematricesfromthedatasetsbyrepresentingdrugandtrialfeaturesforeach
drug-indicationpairasvectors(seeFig6).Drug-indicationpairsassociatedwithmultiple
trialsarerepresentedbythesamenumberoffeaturevectors,e.g.,apairwithtwotrialshas
tworows.WegiveaconcreteexampleinFig6.Considerthedrug-indicationpairAnaliptin-
diabetestype2intheP2APPdataset.Werepresentitusingtwovectorrowssinceithastwo
phase2trialsinTrialtrove.Notethatthefeaturematrixisincompleteduetomissingdrug
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page17of44©2019byLo,Siah,andWong,AllRightsReserved
andtrialfeatures.Wealsoconstructacolumnvectoroflabels,whichcontainstheoutcomes
of the drug-indication pairs. Labels are not available for pipeline drug-indication pairs
because theyare still indevelopmentand theiroutcomesare stilluncertain,hence these
observationsarenotusedtotrainourclassifiers.However,withthetrainedclassifiers,we
cangeneratepredictionsforpipelinedata.
Wespliteachdataset(excludingpipelinedrugs-indicationpairs)intotwodisjointsets,one
trainingsetandonetestingset,andformfeaturematricesforbothaccordingtothedrug-
indicationpairsineachset.Thetestingsetsserveasout-of-sampledatasetstoevaluateour
models.Therefore,wemasktheiroutcomes(thatis,wetreatthemasunknown)andaccess
themonlyattheveryendtocheckourperformance.
Fig6.Featurematrixofdataset.Eachrowcorrespondstoafeaturevector;eachfeaturecorrespondstoanentryinthevector;eachvectorhasalengthof144sincewehave31drugand113trialfeatures.Featurevectorsofalldrug-indicationpairsinthedatasetformthefeaturematrixcollectively.TrialIDisauniquetrialidentifierinTrialtrove.
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page18of44©2019byLo,Siah,andWong,AllRightsReserved
Todealwithmissingdatainbothtrainingandtestingsets,weconsiderlistwisedeletionand
four statistical imputation techniques commonly used in social science research and
biostatistics: unconditional mean imputation, k-nearest neighbor (kNN) imputation,
multipleimputation(MI),anddecision-treealgorithms(seeSupplementaryMaterialsCfor
details). We follow best practices of the missing-data literature by including as many
relevantauxiliaryvariablesaspossible,aswellasallvariablesusedinsubsequentmodels
(Collins,Schafer,&Kam,2001;Enders,2010;Rubin,1996;Schafer&Graham,2002).This
makestheassumptionofMARmoreplausibleinourdatasets,andhelpstoreducebiasin
subsequent analyses (Schafer, 1997). In particular, it is necessary to include our target
variable—the drug-indication development status—inour imputationmodel becausewe
hypothesize thatmissingness ismainly accounted for by it. This is not an issue for the
trainingsets.However,theoutcomesinthetestingsetsaremasked,andnotsupposedtobe
known.Therefore,wetreatthetestingsetoutcomesasthoughtheyweremissingandimpute
themtogetherwithalltheothermissingfeatures.Afterimputation,wediscardtheimputed
testing-setoutcomes,anduseonlythe imputedfeaturevalues forpredictions.Wedothe
samewhenevaluatingpipelinedatasets.
Withrespect to themachine-learningalgorithm,weexploreseveral linearandnon-linear
classifiers commonly used in this literature: penalized logistic regression (PLR), random
forests(RF),neuralnetworks(NN),gradientboostingtrees(GBT),supportvectormachines
with radial basis functions (SVM), and decision trees C5.0. We implement the first five
algorithmsusingthescikit-learnpackageinPython(Pedregosaetal.,2011)andthesixth
usingtheC50packageinR(Kuhn,Weston,Coulter,&Quinlan,2014).Fortraining,weweight
each featurematrix rowexampleaccording to thenumberof trialsof the corresponding
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page19of44©2019byLo,Siah,andWong,AllRightsReserved
drug-indicationpair.Inourearlierexample,thedrug-indicationpairAnaliptin-diabetestype
2was involved in twophase2 trials. It is representedby twovector rows in the feature
matrix(seeFig6).Bothrowsareusedastrainingexamples,andeachisweightedequally
duringtraining(0.5,sincetherearetwotrials in total).Toobtainpredictions foradrug-
indicationpair,weaveragetheoutputprobabilitiesandscoresofthecorrespondingfeature
vectorrowsthatareusedasinputstotheclassifier.
All machine-learning algorithms have hyper-parameters that affect the flexibility of the
modelandmustbetunedtoeachdatasettooptimizegoodnessoffit.Poorly-chosenhyper-
parameterscanleadtooverfitting(attributingsignaltonoise)orunderfitting(attributing
noise to signal).We tuneourparametersusingk-fold cross-validation (withk=5or10,
dependingonthesamplesize).Sincethecross-validationprocessshouldemulatethetesting
processascloselyaspossible,weincludeimputationinthecross-validationloopaswell.We
splitthetrainingsetintovalidationandnon-validationfolds.Thenwetreatvalidationfold
outcomesasmissing,andimputethemaswewouldforatestingset.Fromhere,weignore
theimputedvalidationfoldoutcomesandproceedwiththestandardvalidationprocess.
Inthefinalstep,wetestthetrainedclassifiersontheunseentestingsetsforout-of-sample
modelvalidation.Thisgivestheexpectedperformanceofourpredictivemodelsforeachof
thescenarios,usingthestandardAUCmetrictomeasuremodelperformance.
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page20of44©2019byLo,Siah,andWong,AllRightsReserved
3 ResultsSimulationoflistwisedeletionversusimputationWe study the effects of imputation using a “gold-standard” dataset derived from the
completecasesoftheP2APPdataset(seeTable7).Tosimulatethemissingnesspresentin
theoriginaldataset,we introducemissingness in thegold-standarddatasetbasedonour
MARassumptionandthemissingnesspatternsobservedintheP2APPdataset.Werandomly
splitthedrug-indicationpairsintoatrainingset(70%)andatestingset(30%),andusefive
differentmissingdataapproaches,asdescribedinSupplementaryMaterialsC,togenerate
completetrainingsetsfromtheMARtrainingset.Weuseeachimputedtrainingsettobuild
six different predictive models (PLR, RF, NN, GBT, SVM, and C5.0) according to the
methodologyoutlined inSection2.We repeat this experiment100 times for robustness.
Table8summarizestheAUCperformanceoftheclassifiersonthegold-standardtestingsets.
SeeSupplementaryMaterialsEforamoredetaileddescriptionandresults.
Table7.Samplesizeofthegold-standarddataset(derivedfromcompletecasesofP2APP).
Counts Drug-indication
PairsPhase2Trials
UniqueDrugs
UniqueIndications
UniquePhase2Trials
Success 166 341 152 83 337Failure 812 1,672 503 158 1,549Total 978 2,013 623 171 1,872
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page21of44©2019byLo,Siah,andWong,AllRightsReserved
Table8.AUCofdifferent classifiersunderdifferentmissingdataapproaches.Abbreviations:Avg:average;Sd:standarddeviation;5%:5thpercentile;50%:median;95%:95thpercentile;m:numberofimputationsgenerated.
Gold-StandardTestingSetAUC
ImputationMethod Machine-learningModel Avg Sd 5% 50% 95%
Gold-Standard
PLR
0.810 0.028 0.761 0.808 0.853CompleteCases 0.755 0.040 0.683 0.764 0.813Mean/mode 0.778 0.031 0.729 0.779 0.823Median/mode 0.778 0.031 0.728 0.779 0.8245NN 0.786 0.032 0.738 0.787 0.83410NN 0.787 0.032 0.739 0.791 0.835MI(m=1) 0.781 0.036 0.722 0.777 0.843MI(m=10) 0.782 0.031 0.729 0.782 0.831Gold-Standard
RF
0.837 0.027 0.793 0.837 0.876CompleteCases 0.764 0.048 0.685 0.772 0.830Mean/mode 0.775 0.031 0.726 0.771 0.822Median/mode 0.774 0.031 0.723 0.774 0.8275NN 0.805 0.033 0.755 0.805 0.85710NN 0.802 0.033 0.747 0.805 0.856MI(m=1) 0.797 0.033 0.748 0.795 0.853MI(m=10) 0.804 0.030 0.751 0.804 0.848Gold-Standard
NN
0.800 0.032 0.754 0.799 0.849CompleteCases 0.715 0.043 0.638 0.716 0.779Mean/mode 0.790 0.037 0.739 0.789 0.848Median/mode 0.789 0.036 0.740 0.792 0.8495NN 0.794 0.032 0.743 0.798 0.84210NN 0.797 0.036 0.737 0.798 0.851MI(m=1) 0.780 0.036 0.719 0.781 0.838MI(m=10) 0.795 0.030 0.750 0.795 0.838Gold-Standard
GBT
0.820 0.028 0.776 0.821 0.868CompleteCases 0.746 0.050 0.659 0.756 0.816Mean/mode 0.781 0.034 0.724 0.784 0.826Median/mode 0.778 0.033 0.719 0.783 0.8235NN 0.796 0.029 0.737 0.798 0.83710NN 0.796 0.028 0.748 0.798 0.838MI(m=1) 0.796 0.031 0.747 0.796 0.847MI(m=10) 0.804 0.031 0.757 0.803 0.854Gold-Standard
SVM
0.785 0.030 0.730 0.786 0.831CompleteCases 0.733 0.053 0.650 0.741 0.795Mean/mode 0.766 0.036 0.707 0.771 0.818Median/mode 0.764 0.035 0.711 0.771 0.8185NN 0.771 0.034 0.722 0.770 0.82710NN 0.772 0.037 0.710 0.773 0.825MI(m=1) 0.760 0.035 0.696 0.762 0.813MI(m=10) 0.768 0.030 0.719 0.764 0.813Gold-Standard
C5.0
0.800 0.033 0.758 0.800 0.844CompleteCases 0.710 0.063 0.585 0.713 0.802Mean/mode 0.758 0.039 0.698 0.762 0.816Median/mode 0.754 0.043 0.679 0.751 0.8235NN 0.772 0.038 0.715 0.772 0.84310NN 0.770 0.035 0.710 0.771 0.822MI(m=1) 0.758 0.037 0.701 0.754 0.819MI(m=10) 0.807 0.031 0.756 0.808 0.857
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page22of44©2019byLo,Siah,andWong,AllRightsReserved
Forallsixmachine-learningalgorithms,wefindthatgold-standardclassifiers—thatis,the
modelsderivedfromcompletedata—consistentlyoutperformtheircomplete-caseanalysis
and imputation counterparts.This is logicalbecauseuseful information is invariably lost
whenwe introducemissingness in thedatasets. Incontrast,complete-caseanalysisoften
leadstoinferiorperformance.TheAUCsofclassifierstrainedoncomplete-casestrainingsets
tendtobesmallerthanthosetrainedonimputedtrainingsets.Thissuggeststhatimputation
doesindeedofferimprovedfitandpredictivepoweroverlistwisedeletion.
Overall,wefindkNNimputationtobemostcompatiblewithourdatasets.Itprovidesthe
leastbiasedimputationsamongallmissingdatamethods(seeSupplementaryMaterialsE).
Inparticular,thecombinationofkNNimputation(k=5)withRFgivesoneofthehighest
gold-standard testing set AUCs (0.81). We note a few other MI combinations that yield
comparable or marginally better performance but focus on the 5NN-RF approach in
subsequentanalysesonthemaindatasetsduetoitseaseofimplementationandapplication.
WefindthatSVMhastheworstperformanceamongallmachine-learningmodels.Thisisnot
surprisingbecauseSVMsareaimedonlyatlearningbinaryclassifiers,anddonotgenerally
producegoodclassprobabilityestimates.Consequently,suchmodelsdonotnecessarilygive
highAUCs.
WealsocompareourapproachwiththeANDIalgorithm(DiMasietal.,2015)byapplyinga
modified version of the index on oncology drugs in the gold-standard testing sets (see
SupplementaryMaterialsHforamorein-depthdescription).Wefindthatour5NN-RFmodel
achievessignificantlyhigherAUCthanthemodifiedANDI,withanaverageimprovementof
0.1inAUCover100simulations(seeFig7).Webelievethatthisgaincanbeattributedtoa
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page23of44©2019byLo,Siah,andWong,AllRightsReserved
largertrainingsetwithawiderrangeof features,anonlinearmodel thatcancapturethe
complexrelationshipsinthedata,andapropermodelvalidationmethodology.
Fig7.DistributionsofAUCof5NN-RFandthemodifiedANDIononcology-onlygold-standardtestingsets.
PredictingapprovalsWeanalyzethetwodatasets(P2APPandP3APP)byfirstsplittingeachintoatrainingset
(70%)andatestingset(30%)randomly(pipelinedrug-indicationpairsareomittedsince
theiroutcomeshaveyettobedetermined).Subsequently,wetrain5NN-RFmodelsforeach
scenarioaccordingtothemethodologyoutlinedinSection2.Werepeatthisexperiment100
timesforrobustness.Table9summarizestheAUCperformancemetricsforthetestingsets.
Onaverage,weachieve0.78AUCforP2APPand0.81AUCforP3APP.
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page24of44©2019byLo,Siah,andWong,AllRightsReserved
Table 9. Comparison of the general and indication-group specific classifiers for selected indication groups.Abbreviations:Avg:average;Sd:standarddeviation;5%:5thpercentile;50%:median;95%:95thpercentile.
GeneralClassifier SpecializedClassifiers Avg Sd 5% 50% 95% Avg Sd 5% 50% 95%
P2APPAll 0.777 0.017 0.749 0.775 0.806 - - - - -Anti-cancer 0.805 0.025 0.764 0.805 0.847 0.818 0.029 0.773 0.819 0.865RareDiseases 0.800 0.028 0.756 0.800 0.848 0.775 0.036 0.715 0.777 0.838Neurological 0.767 0.036 0.710 0.769 0.819 0.778 0.039 0.721 0.779 0.834Alimentary 0.749 0.045 0.672 0.751 0.817 0.732 0.048 0.651 0.734 0.807Immunological 0.783 0.065 0.665 0.786 0.889 0.766 0.069 0.646 0.775 0.860Anti-infective 0.735 0.043 0.673 0.736 0.800 0.750 0.047 0.684 0.746 0.832Respiratory 0.756 0.055 0.648 0.764 0.835 0.867 0.043 0.794 0.872 0.921Musculoskeletal 0.822 0.049 0.736 0.821 0.899 0.731 0.076 0.614 0.745 0.849Cardiovascular 0.709 0.072 0.580 0.711 0.812 0.694 0.073 0.579 0.698 0.807Genitourinary 0.633 0.086 0.503 0.634 0.790 0.706 0.091 0.552 0.710 0.840
P3APPAll 0.810 0.018 0.781 0.810 0.834 - - - - -Anti-cancer 0.783 0.047 0.699 0.779 0.853 0.707 0.054 0.612 0.714 0.786RareDiseases 0.819 0.054 0.727 0.822 0.896 0.786 0.058 0.687 0.793 0.875Neurological 0.796 0.037 0.734 0.794 0.857 0.789 0.038 0.741 0.787 0.853Alimentary 0.817 0.047 0.744 0.820 0.891 0.805 0.054 0.718 0.808 0.888Immunological 0.811 0.074 0.680 0.815 0.910 0.757 0.099 0.586 0.765 0.892Anti-infective 0.757 0.065 0.644 0.752 0.854 0.708 0.068 0.600 0.707 0.808Respiratory 0.823 0.065 0.712 0.831 0.920 0.773 0.083 0.627 0.784 0.907Musculoskeletal 0.741 0.095 0.576 0.747 0.866 0.763 0.072 0.646 0.762 0.882Cardiovascular 0.794 0.058 0.702 0.788 0.887 0.755 0.076 0.639 0.765 0.864Genitourinary 0.814 0.083 0.670 0.821 0.937 0.801 0.090 0.635 0.808 0.927
The observed performance is essentially the MAR testing set AUC, since backfilling has
alreadyaffectedthedatasetsused.InSupplementaryMaterialsE,wehighlighttheperilsof
relyingontheMARtestingsetformodelvalidation,andsuggestthattheAUCsforthegold-
standard and MCAR testing sets are more reflective of a classifier’s real performance.
Unfortunately, we have access to neither the gold-standard nor the MCAR testing sets,
becausewedonotknowthetrue,underlyingvaluesofthemissingfeatures.However,our
experiments indicate that the AUCs for the MAR and MCAR testing sets of the 5NN-RF
combinationareveryclose(adifferenceof0.002onaverage).Thismeansthatwemayuse
theformer,theonlyobservedfigure,asareasonableestimateofthelatter,whichreflects
realperformance.
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page25of44©2019byLo,Siah,andWong,AllRightsReserved
Next,wetrainclassifiersbasedontheunionofthetrainingandtestingsets,andusethemto
generatepredictionsforpipelinedrug-indicationpairs.WegeneratepredictionsforP2APP
usingonlyinformationfromphase2trialsandforP3APPusingonlyinformationfromphase
3trials.WhilewecannotcomputeAUCscoresforthesesamplesbecausetheiroutcomesare
stillpending,wecancomparetheirpredictionscoreswiththeirdevelopmentstatusatthe
timeof thiswriting.Thesepipelinedrug-indicationpairsmaystillbe in thesameclinical
stage(nochange,i.e.,phase2forP2APP;phase3forP3APP),beterminated(failed),orhave
progressedtohigherphases(advanced).
Fig8,Table10,andTable11summarizethedistributionsofpipelinepredictionscores.We
findthatpairsthatfailgenerallyhavelowerscoresthanthosethatadvancetolaterphases
ofdevelopment.InFig8,weobservepeaksatthelowerendofthescorespectrumforfailed
pairs(red) forbothdatasets. Incontrast,pairs thatadvancetendtohavepeaksathigher
scores(green).Weobservethesamepatternswhenwedisaggregatethedistributionsby
indicationgroups:thegreenpartstendtoclusterabovethedistributionmedianwhilethe
redpartsclusterbelow.However,therearealsosomeindicationgroupsforwhichthereare
toofewsamplestomakeanyusefulremarks(e.g.,hormonalproductsinP2APP).FromTable
10,weseethattheaveragescoresoffailedpairsareindeedlowerthanthosethatadvance
(differencesrangingfrom0.05to0.15).InTable11,webindrug-indicationpairsthathave
newdevelopments (whether failure or advancement) into four groupings, dependingon
theirpredictionscores.Foreachbin,wecomputetheproportionofsamplesthatadvanceto
laterdevelopmentstages.We find that theproportionsgenerally increasewith thescore
magnitude,suggestingthatpairswithhigherscoresaremorelikelytoadvancethanthose
with lower scores.Wenote thatprogress to later clinical stagesdoesnotalways lead to
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page26of44©2019byLo,Siah,andWong,AllRightsReserved
approval. However, the results are still promising because advancement is a necessary
condition for approval. Our experiments indicate that our trained classifiers are able to
discriminatebetweenhigh-andlow-potentialcandidates.
Table10.Distributionsofpredictionscoresforallindicationgroupsinaggregate(seeFig8).Advancedreferstoprogresstoahigherphasefromtheoriginalphase.OriginalphaseforP2APPisphase2;forP3APPisphase3.Forinstance,outof1,511drug-indicationpairsintheP2APPtestingset,859pairsarestillpendingdecisioninphase2,244pairshavefailedand408pairshavesuccessfullyadvancedtophase3testing.Abbreviations:n:samplesize;Avg:average;Sd:standarddeviation;5%:5thpercentile;50%:median;95%:95thpercentile.
PredictionScores n Avg Sd 5% 50% 95%
P2APPAggregate 1,511 0.153 0.061 0.044 0.155 0.258Nochange 859 0.143 0.060 0.041 0.147 0.246Failed 244 0.137 0.061 0.034 0.147 0.240Advanced 408 0.183 0.056 0.093 0.178 0.274
P3APPAggregate 252 0.417 0.189 0.128 0.402 0.695Nochange 142 0.392 0.185 0.129 0.384 0.693Failed 32 0.348 0.185 0.100 0.344 0.656Advanced 78 0.492 0.176 0.233 0.492 0.699
Table11.Distributionsofpredictionscoresforallindicationgroupsinaggregate(seeFig8).Proportionreferstothefractionofsamplesthatadvancedtoalaterphasefromtheoriginalphase.Abbreviations:n:samplesize.
Scores n ProportionP2APP
<0.1 108 0.2310.1-0.2 368 0.6710.2-0.3 171 0.766≥0.3 5 1.000
P3APP<0.2 13 0.3080.2-0.4 35 0.6860.4-0.6 27 0.667≥0.6 35 0.914
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page27of44©2019byLo,Siah,andWong,AllRightsReserved
Fig8.DistributionsofpredictionscoresforP2APPandP3APP.Firstrowforallindicationgroupsinaggregate.Subsequentrowsforspecificindicationgroups.
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page28of44©2019byLo,Siah,andWong,AllRightsReserved
To gain insight into the logic of our trained predictivemodels,we compute the average
importanceoffeaturesusedinthe5NN-RFclassifiersoveralltheexperiments,andextract
thetoptenmostinformativevariables.TheRFclassifier(Pedregosaetal.,2011)weused
computestheimportanceofavariablebyfindingthedecreaseinnodeimpurityforallnodes
thatsplitonthatvariable,weightedbytheprobabilityofreachingthatnode(asestimated
by the proportion of samples reaching that node), averaged over all trees in the forest
ensemble(Breiman,Friedman,Stone,&Olshen,1984).Table12summarizestheresults.
Table12.Toptenimportantvariablesof5NN-RFclassifiersforP2APPandP3APP.Averageandstandarddeviationtakenacrossallexperiments.Abbreviations:Avg:average;Sd:standarddeviation.
Importance Avg Sd
P2APPTrialoutcome–completed,positiveoutcome,orprimaryendpoint(s)met 0.234 0.043Trialstatus 0.160 0.026Medium–solution 0.051 0.018Actualaccrual 0.046 0.010Sponsortype–industry,allotherpharma 0.025 0.008Sponsortrackrecord–numberofpositivephase3trials 0.023 0.006Sponsortrackrecord–numberoffaileddrug-indicationpairs 0.021 0.007Studydesign–placebocontrol 0.019 0.009Targetaccrual 0.018 0.005Priorapprovalofdrugforanotherindication 0.018 0.007
P3APPTrialoutcome–completed,positiveoutcome,orprimaryendpoint(s)met 0.357 0.028Trialstatus 0.148 0.014Duration 0.099 0.016Trialoutcome–terminated,lackofefficacy 0.033 0.010Trialoutcome–completed,negativeoutcome,orprimaryendpoint(s)notmet 0.033 0.008Therapeuticarea–oncology 0.030 0.009Priorapprovalofdrugforanotherindication 0.021 0.007Actualaccrual 0.015 0.003Medium–powder 0.014 0.007Medium–solution 0.012 0.006
Wefindthattrialoutcome(whetherthetrialwascompletedwithitsprimaryendpointsmet)
and trial status (whether the trial was completed or terminated) have significant
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page29of44©2019byLo,Siah,andWong,AllRightsReserved
associationswithsuccess.Thesetwofeatureswereconsistentlyrankedthetoptwooutof
allvariablesandacrossbothdatasets.Itiseasytoimaginethatadrug-indicationpairwhose
trialswereterminatedhasalowprobabilityofsuccessintermsofadvancingfromphases
2/3 to approval. In contrast, candidates that achieve positive outcomes certainly have a
bettershotatsuccess.Wealsoobservethatpriorapprovalofadrughasaneffectonsuccess
fornewindicationsorpatientsegmentation.Itisplausiblethatdevelopinganapproveddrug
foranewindicationhasagreaterlikelihoodofsuccessthananewcandidate.
Inaddition,trialcharacteristicssuchasaccrual,duration,andthenumberofidentifiedsites
frequentlyappearinthetoptenimportantvariables.Thereareseveralpossibleexplanations.
Forexample,trialsthatendquicklywithoutachievingprimaryendpointsmayundermine
the likelihood of success, and drugs with trials that have small accrual—and thus low
statisticalpower—mayhavealowerprobabilityofbeingapproved.
Wealsofindsponsortrackrecords—quantifiedbythenumberofpastsuccessfultrials(trials
thatachievepositiveresultsormeetprimaryendpoints)—tobeausefulfactorforprediction.
This factor has not been considered in previous related studies, but the intuition for its
predictivepowerisclear:strongtrackrecordsarelikelyassociatedwithgreaterexpertisein
drugdevelopment.
Sincedrugsdevelopedfordifferentindicationgroupsmayhaveverydifferentcharacteristics,
wemightexpectclassifierstrainedonindication-group-specificdatatooutperformgeneral
classifiers.We build and analyze such specialized classifiers by filtering the datasets by
indicationgroupbeforeperformingtheexperimentdescribedintheprevioussection.Asa
comparison,wealsobreakdown theperformanceof thegeneral classifiersby indication
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page30of44©2019byLo,Siah,andWong,AllRightsReserved
group. Table 9 shows the results for selected indication groups. In general, we find
specializedmodelstogivepoorerperformancethangeneralmodels.Thisislikelybecause
theformeraretrainedonlessdata,whichmakesthemlessaccurateandmoresusceptibleto
overfitting.
We note that the approach adopted in this section—splitting drug-indication pairs into
trainingandtestingsetsrandomlywithoutconsideringthedatesofdevelopment—maybe
less than idealbecauseof look-aheadbias.Forexample, if the resultsof a2008 trial are
included inthetrainingset forpredictingtheoutcomeofa2004developmentpath fora
drug-indicationpair,ourmodelwillbeusingfutureinformationduringvalidation,whichcan
yieldmisleadingand impractical inferences.Toaddressthis issue, in thenextsectionwe
apply our machine-learning framework to time-series data using rolling windows that
accountfortemporalorderingintheconstructionoftrainingandtestingsets.Althoughthis
processmakesuseoflessdatawithineachestimationwindowthanwhentheentiredataset
isused,itminimizestheimpactoflook-aheadbiasandyieldsmorerealisticinferences.We
studytheeffectsofrandomsplittingversustemporalorderinginSupplementaryMaterials
J.
PredictionsovertimeDrugdevelopmenthaschangedsubstantiallyovertime,thankstonewscientificdiscoveries
and technological improvements.To reflect these changes inourpredictiveanalytics,we
adoptatimeseries,walk-forwardapproachtocreatetrainingandtestingsetsforeachofthe
twodatasets,P2APPandP3APP(seeFig9).Wesamplefive-yearrollingwindowsbetween
2004and2014fromeachdataset.Eachwindowconsistsofatrainingsetofdrug-indication
pairswhoseoutcomesbecomefinalizedwithinthewindow,andanout-of-sample,out-of-
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page31of44©2019byLo,Siah,andWong,AllRightsReserved
timetestingsetofdrug-indicationpairsthatendedphase2orphase3testing,butarestillin
thepipelinewithundeterminedoutcomeswithin thewindow.Forexample, consider the
P2APPdataset.Wedrawthe firstwindowfrom2004–2008, trainouralgorithmondrug-
indicationpairsthatfailedorwereapprovedwithinthisperiodasthetrainingset,andapply
thetrainedmodeltopredicttheoutcomesofdrug-indicationsthatjustendedphase2testing
withinthesamewindowasthetestingset.
Fig9.Time-serieswalk-forwardanalysisapproach.Thetestingset inthelastwindow(green)comprisesdrug-indicationpairsinthepipelineatthetimeofsnapshotofthedatabases.
We evaluate the resulting classifier by comparing its predictionswith outcomes that are
realized in the future (2009–2015). This rolling-window approach yields a total of eight
overlapping trainingand testingperiodswhereanew5NN-RFmodel is trained foreach
period.Theeighthtestingperiodconsistsofdrug-indicationpairsinthepipelineatthetime
ofsnapshotofthedatabases.Unlikethefirstsevenperiods,theiroutcomesarestillpending
current development, and thereforewe cannot compute a testing AUC for this window.
However,wecanexaminethepredictionsandcomparethescoreswiththeirdevelopment
statusesatthetimeofthiswriting.
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page32of44©2019byLo,Siah,andWong,AllRightsReserved
Fig10summarizestheresultsofthetime-seriesanalysisforthefirstsevenwindows.We
observeanincreasingtrendovertheyearsforbothP2APP(0.67inthefirstand0.80inthe
lastwindow)andP3APP(0.77inthefirstand0.88inthelastwindow).Interestingly,wenote
thattheproportionsofcompletecasesinthetrainingsetscorrelatewellwiththetimeseries
AUC(correlationcoefficient0.95forP2APPand0.90forP3APP).Wecomputetheproportion
ofcompletecasesbytakingthenumberoffeaturevectorrowswithcompleteinformation
over the total number of rows. As is apparent from Fig 10, the proportions have been
increasing over the years for both datasets. This is likely due to better data reporting
practicesbydrugdevelopers,apossibleconsequenceofFDAAA.
Fig10.Time-serieswalk-forwardanalysisforP2APPandP3APPusing5NN-RF.Weusebootstrappingtodeterminethe95%CIforAUC(dottedlines).Thedashedlinesplotthecorrespondingproportionsofcompletecasesinthetrainingsetsofeachfive-yearwindow.Abbreviations:CC:proportionofcompletecases.
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page33of44©2019byLo,Siah,andWong,AllRightsReserved
Next,weexaminethe2011–2015window.Fig11,Table13,andTable14summarizethe
distributions of prediction scores for the P2APP and P3APP datasets. We observe very
similarpatternstothestaticpipelinepredictionsabove.Thehistograms,averagescores,and
binningofsamples indicatethatpairs that fail tendtohave lowerpredictionscoresthan
thosethatadvance.Thisshowsthatourclassifiersareindeedabletodifferentiatesuccessful
candidates.
Table13.Distributionsofpredictionscoresforallindicationgroupsinaggregate(seeFig11).Advancedreferstoprogresstoahigherphasefromtheoriginalphase.OriginalphaseforP2APPisphase2; forP3APPisphase3.Abbreviations:n:samplesize;Avg:average;Sd:standarddeviation;5%:5thpercentile;50%:median;95%:95thpercentile.
PredictionScores n Avg Sd 5% 50% 95%
P2APPAggregate 1,190 0.158 0.080 0.036 0.173 0.290Nochange 712 0.148 0.080 0.035 0.158 0.275Failed 195 0.143 0.079 0.034 0.149 0.255Advanced 283 0.197 0.071 0.068 0.200 0.323
P3APPAggregate 218 0.431 0.211 0.113 0.476 0.689Nochange 121 0.395 0.207 0.113 0.403 0.684Failed 28 0.362 0.211 0.093 0.335 0.640Advanced 69 0.521 0.193 0.149 0.631 0.707
Table14.Distributionofpredictionscoresforallindicationgroupsinaggregate(seeFig11).Proportionreferstothefractionofsamplesthatadvancedtoahigherphasefromtheoriginalphase.Abbreviations:n:samplesize.
Scores n ProportionP2APP
<0.1 99 0.3130.1-0.2 183 0.6070.2-0.3 168 0.690≥0.3 28 0.893
P3APP<0.2 17 0.4120.2-0.4 17 0.7060.4-0.6 17 0.647≥0.6 46 0.848
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page34of44©2019byLo,Siah,andWong,AllRightsReserved
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page35of44©2019byLo,Siah,andWong,AllRightsReserved
Fig11.Distributionsofpredictionscoresofthe2011–2015windowtestingsetforP2APPandP3APP.Firstrowforallindicationgroupsinaggregate.Subsequentrowsforspecificindicationgroups.
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page36of44©2019byLo,Siah,andWong,AllRightsReserved
Table15summarizesthetoptenmostinformativevariablesinthe5NN-RFclassifiersover
theeightrollingwindows.Wefindthemtobelargelyconsistentwiththoseobservedinthe
staticcase:thetrialoutcomeandtrialstatusaresignificantlyassociatedwithsuccess;trial
characteristics (such as accrual, duration, and number of identified sites), sponsor track
record,anddrugmediumappearfrequentlyinbothscenarios.
Table15.Toptenimportantvariablesin5NN-RFclassifiersforP2APPandP3APP.Averageandstandarddeviationtakenacrosstheeightrollingwindows.Abbreviations:Avg:average;Sd:standarddeviation.
Importance Avg Sd
P2APPTrialoutcome–completed,positiveoutcome,orprimaryendpoint(s)met 0.203 0.083Trialstatus 0.102 0.033Priorapprovalofdrugforanotherindication 0.077 0.061Actualaccrual 0.039 0.015Targetaccrual 0.031 0.010Duration 0.027 0.014Sponsortrackrecord–numberofcompletedphase3trials 0.025 0.007Medium–suspension 0.024 0.018Sponsortype–academic 0.023 0.017Medium–solution 0.021 0.019
P3APPTrialoutcome–completed,positiveoutcome,orprimaryendpoint(s)met 0.348 0.028Trialstatus 0.125 0.020Duration 0.053 0.017Priorapprovalofdrugforanotherindication 0.046 0.028Trialoutcome–completed,negativeoutcome,orprimaryendpoint(s)notmet 0.033 0.026Targetaccrual 0.021 0.005Trialoutcome–terminated,lackofefficacy 0.020 0.013Actualaccrual 0.019 0.004Therapeuticarea–oncology 0.017 0.013Numberofidentifiedsites 0.012 0.002
Asinthestaticcase,wealsotrainindication-groupspecificclassifiersusingrollingwindows.
Table16andTable17summarizetheresultsforselectedindicationgroupsinP2APPand
P3APP, respectively (see Supplementary Materials G for results of all other indication
groups). Indication groups with small sample sizes tend to produce poor and unstable
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page37of44©2019byLo,Siah,andWong,AllRightsReserved
specializedclassifiers(e.g.,themusculoskeletalindicationgroupinP2APP).Thisisexpected
becausemodelstrainedonsmalltrainingsetsaremoresusceptibletooverfitting,especially
whennon-linearalgorithmssuchasRFareused.Incontrast,indicationgroupswithlarger
samplesizestendtogiverisetorathergoodclassifiers(e.g.,anti-cancerinP2APP).
Table16.Comparisonof thegeneraland indication-groupspecific classifiers for selected indicationgroups inP2APP.Weusebootstrappingtodeterminethe95%CIforAUC.
GeneralClassifier SpecializedClassifiers TrainSet TestSet AUC[95%CI] TrainSet TestSet AUC[95%CI]
All2004–2008 1,361 551 0.669[0.614,0.725] - - -2005–2009 1,562 591 0.680[0.625,0.735] - - -2006–2010 1,764 636 0.712[0.668,0.755] - - -2007–2011 1,969 598 0.738[0.698,0.777] - - -2008–2012 2,082 597 0.799[0.760,0.837] - - -2009–2013 2,212 517 0.823[0.779,0.867] - - -2010–2014 2,289 380 0.797[0.718,0.876] - - -
Anti-cancer2004–2008 1,361 137 0.665[0.528,0.803] 456 137 0.683[0.533,0.833]2005–2009 1,562 163 0.739[0.618,0.861] 494 163 0.635[0.512,0.758]2006–2010 1,764 188 0.774[0.702,0.846] 546 188 0.726[0.635,0.816]2007–2011 1,969 193 0.830[0.773,0.887] 618 193 0.746[0.661,0.831]2008–2012 2,082 198 0.805[0.717,0.894] 682 198 0.760[0.665,0.855]2009–2013 2,212 177 0.852[0.783,0.922] 736 177 0.786[0.696,0.876]2010–2014 2,289 173 0.815[0.691,0.938] 791 173 0.803[0.666,0.940]
Musculoskeletal2004–2008 1,361 35 0.765[0.597,0.933] 96 35 0.704[0.512,0.896]2005–2009 1,562 38 0.716[0.489,0.944] 109 38 0.674[0.472,0.876]2006–2010 1,764 35 0.634[0.439,0.830] 111 35 0.509[0.276,0.742]2007–2011 1,969 37 0.737[0.571,0.903] 119 37 0.677[0.493,0.860]2008–2012 2,082 36 0.884[0.773,0.995] 127 36 0.683[0.462,0.904]2009–2013 2,212 26 0.792[0.573,1.000] 133 26 0.667[0.429,0.904]2010–2014 2,289 19 0.882[0.724,1.000] 128 19 0.882[0.706,1.000]
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page38of44©2019byLo,Siah,andWong,AllRightsReserved
Table17.Comparisonof thegeneraland indication-groupspecific classifiers for selected indicationgroups inP3APP.Weusebootstrappingtodeterminethe95%CIforAUC.
GeneralClassifier SpecializedClassifiers TrainSet TestSet AUC[95%CI] TrainSet TestSet AUC[95%CI]
All2004–2008 472 196 0.769[0.704,0.834] - - -2005–2009 559 177 0.724[0.650,0.798] - - -2006–2010 604 211 0.738[0.671,0.805] - - -2007–2011 664 174 0.806[0.740,0.871] - - -2008–2012 677 197 0.827[0.768,0.886] - - -2009–2013 740 153 0.868[0.809,0.927] - - -2010–2014 734 110 0.876[0.811,0.941] - - -
Anti-cancer2004–2008 472 34 0.773[0.618,0.928] 95 34 0.684[0.495,0.874]2005–2009 559 28 0.740[0.543,0.936] 107 28 0.568[0.345,0.791]2006–2010 604 50 0.754[0.599,0.910] 110 50 0.630[0.452,0.809]2007–2011 664 24 0.587[0.333,0.842] 132 24 0.392[0.132,0.651]2008–2012 677 40 0.793[0.549,1.000] 134 40 0.668[0.457,0.879]2009–2013 740 29 0.800[0.480,1.000] 151 29 0.775[0.528,1.000]2010–2014 734 26 0.943[0.842,1.000] 153 26 0.852[0.558,1.000]
RareDiseases2004–2008 472 22 0.711[0.465,0.957] 54 22 0.620[0.364,0.876]2005–2009 559 23 0.735[0.517,0.952] 60 23 0.606[0.360,0.852]2006–2010 604 24 0.888[0.747,1.000] 66 24 0.825[0.645,1.000]2007–2011 664 22 0.838[0.652,1.000] 72 22 0.735[0.520,0.950]2008–2012 677 34 0.893[0.780,1.000] 76 34 0.700[0.523,0.877]2009–2013 740 28 0.962[0.899,1.000] 94 28 0.932[0.840,1.000]2010–2014 734 18 0.908[0.766,1.000] 109 18 0.985[0.942,1.000]
For comparison, we disaggregate performance by indication group. We find that these
classifiersdonotloseouttotheirspecializedcounterparts.Infact,ourresultsshowthatthe
formertendtoexhibitmorestableperformanceacrossthesevenwindows,particularlyon
indicationgroupswithsmallsamplesizes.Wehypothesizethatclassifierstrainedonalldata
benefit fromhavingaccess to largerdatasetswithgreaterdiversity, andare thusable to
makemoreinformedpredictions.Thissuggeststhatitmaybemoreappropriatetorelyon
generalclassifiers,ratherthanspecializedones,forpredictionsovertimewheresamplesare
spreadoutovermultiplewindows,sincefurtherfilteringbyindicationgroupresultsineven
smallersamplesizes.
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page39of44©2019byLo,Siah,andWong,AllRightsReserved
Finally,weextractthetopfiveP2APPpipelinedrugcandidateswiththehighestscoresin
each indication group as predicted by the 2011–2015 rolling-window model. Table 18
summarizestheresults.Weincludeonlycandidatesthatarestilloutstandingatthetimeof
writing(neitherdiscontinuednorapproved).Itisencouragingthatmanyofthesecandidates
(indicatedinitalics)haveadvancedbeyondphase2testingsinceouranalysis,indicatingthe
predictivepowerofourmodels.WeincludeaninteractiveversioninFig12wherereaders
canfilterourpipelinepredictionsbyindicationgroupandprobabilityofapproval.Ultimately,
allbiopharmastakeholderscanusesuchscorestorankandevaluatethepotentialrisksand
rewardsofdrugcandidates.
Fig12.NetworkgraphofP2APPpipelinedrugcandidates.Blacknodescorrespondtoindicationgroups.Colorednodescorrespondtodrug-indicationpairs.Eachdrug-indicationpairnodeisconnectedtoitsparentindicationgroupandalsootherdrug-indicationpairs thathave thesame indication.Theyarecoloredaccordingto theirrespectiveprobabilityofapprovalaspredictedbyourmodel—blueforhigherscoresandredforlowerscores.Hoverovernodesfordetailsofeachdrug-indicationpair.Blackindicationgroupnodesaresizedbasedonthenumberofconnections.
Theinformationpresentedthroughthisresearchandintheincludedfigurearemadeavailablesolelyforgeneralinformationalpurposes.Theauthorsdonotwarranttheaccuracy,completenessorusefulnessofthisinformation.Anyrelianceyouplaceonsuchinformationisstrictlyatyourownrisk.Theauthorsexpresslydisclaimallliabilityandresponsibilityarisingfromanyrelianceplacedonsuchinformationbyyou,orbyanyonewhomaybeinformedofthisinformation.
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page40of44©2019byLo,Siah,andWong,AllRightsReserved
Table18.TopfiveP2APPpipelinedrugcandidateswiththehighestscoresineachindicationgroupaspredictedbyourmodel.Weincludeonlycandidatesthatarestilloutstandingatthetimeofwriting(neitherdiscontinuednorapproved).Drug-indicationpairs in italicsare those thathaveadvancedbeyondphase2 testingsinceouranalysis.
Drug Indication Score Drug Indication ScoreAnti-cancer Musculoskeletalontecizumab Cancer,colorectal 0.34 tofacitinib Arthritis,psoriatic 0.31
calmangafodipirRadio/chemotherapy-inducedinjury,bonemarrow,neutropenia
0.31 ixekizumab Arthritis,rheumatoid 0.31
tivantinib Cancer,sarcoma,softtissue 0.30 anti-BLyS/APRILantibodyfusionprotein Arthritis,rheumatoid 0.31
pidilizumab Cancer,colorectal 0.29 sirukumab Arthritis,rheumatoid 0.29NK-012 Cancer,colorectal 0.28 romosozumab Osteoporosis 0.28RareDiseases Cardiovascularsurotomycin Infection,Clostridiumdifficile 0.34 K-134 Peripheralvasculardisease 0.37tivantinib Cancer,sarcoma,softtissue 0.30 nitricoxide,inhaled Hypertension,pulmonary 0.29
VP-20621 Infection,Clostridiumdifficileprophylaxis 0.30 TY-51924 Infarction,myocardial 0.28
KHK-7580 Secondaryhyperparathyroidism 0.29 s-amlodipine+telmisartan Hypertension,unspecified 0.27nitricoxide,inhaled Hypertension,pulmonary 0.29 tirasemtiv Peripheralvasculardisease 0.24Neurological Dermatological
dasotraline Attentiondeficithyperactivitydisorder 0.35 tofacitinib Arthritis,psoriatic 0.31
idalopirdine Alzheimer'sdisease 0.35 dimethylfumarate Psoriasis 0.27GRC-17536 Neuropathy,diabetic 0.34 pefcalcitol Psoriasis 0.24caprylictriglyceride Alzheimer'sdisease 0.32 Benvitimod Psoriasis 0.22
levodopa Parkinson'sdisease 0.31 calcipotriolmonohydrate+betamethasonedipropionate
Psoriasis 0.22
Alimentary Genitourinary
ibodutant Irritablebowelsyndrome,diarrhoea-predominant 0.37
etonogestrel+estradiol(vaginalring),nextgeneration
Contraceptive,female 0.30
GRC-17536 Neuropathy,diabetic 0.34 drospirenone+estradiol Contraceptive,female 0.28mesalazine+N-acetylcysteine Colitis,ulcerative 0.31 finerenone Nephropathy,diabetic 0.27
apabetalone(tablet) Diabetes,Type2 0.31 afacifenacinfumarate Overactivebladder 0.26phosphatidylcholine Colitis,ulcerative 0.31 GKT-137831 Nephropathy,diabetic 0.26Immunological BloodandClotting
tofacitinib Arthritis,psoriatic 0.31 calmangafodipirRadio/chemotherapy-inducedinjury,bonemarrow,neutropenia
0.31
ixekizumab Arthritis,rheumatoid 0.31 balugrastim Radio/chemotherapy-inducedinjury,bonemarrow,neutropenia 0.27
anti-BLyS/APRILantibodyfusionprotein Arthritis,rheumatoid 0.31 eflapegrastim Radio/chemotherapy-induced
injury,bonemarrow,neutropenia 0.25
sirukumab Arthritis,rheumatoid 0.29 pegfilgrastim Radio/chemotherapy-inducedinjury,bonemarrow,neutropenia 0.22
dimethylfumarate Psoriasis 0.27 lexaptepidpegol Radio/chemotherapy-inducedanaemia 0.20
Anti-infective Sensory
delafloxacin Infection,skinandskinstructure,acutebacterial 0.39 AR-13324+latanoprost Glaucoma 0.27
surotomycin Infection,Clostridiumdifficile 0.34 S-646240 Maculardegeneration,age-related,wet 0.27
delafloxacin Infection,pneumonia,community-acquired 0.33 netarsudil Glaucoma 0.26
plazomicin Infection,urinarytract,complicated 0.33 fenofibrate,micronized-2 Oedema,macular,diabetic 0.25
Ypeginterferonalpha-2b Infection,hepatitis-Cvirus 0.33 LX-7101 Glaucoma 0.21Respiratory Hormonalfluticasone+salmeterol Asthma 0.36 KHK-7580 Secondaryhyperparathyroidism 0.29fluticasonefuroate+umeclidinium+vilanterol
Chronicobstructivepulmonarydisease 0.36 somatropinprodrug,
pegylated Growthhormonedeficiency 0.21
fluticasonefuroate+umeclidinium
Chronicobstructivepulmonarydisease 0.36 2MD Secondaryhyperparathyroidism 0.21
beclometasone+formoterol
Chronicobstructivepulmonarydisease 0.35 velcalcetide Secondaryhyperparathyroidism 0.19
fluticasonepropionateDPI Asthma 0.35 tesamorelinacetate Growthhormonedeficiency 0.18
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page41of44©2019byLo,Siah,andWong,AllRightsReserved
4 DiscussionDrugdevelopmentisanextremelycostlyprocessandtheaccurateevaluationofacandidate
drug’s likelihood of approval is critical to the efficient allocation of capital. Historical
successes and failures contain valuable insights on the characteristics of high-potential
candidates. Unfortunately, such data are often incomplete due to partial reporting by
investigatorsanddevelopers.Mostanalyticmethodsrequirecompletedata,however,and
priorstudiesonestimatingapprovalratesandpredictingapprovalsaretypicallybasedona
smallnumberofexamplesthathavecompleteinformationforjustafewfeatures.
In thispaper,we extract twodatasets, P2APP andP3APP, from Informa®databases and
apply5NNstatisticalimputationtomakeefficientuseofallavailabledata.Weusemachine-
learningtechniquestotrainandvalidateourRFpredictivemodelsandachievepromising
levelsofpredictivepowerforbothdatasets.Whenappliedtopipelinedrugs,wefindthat
candidateswithhigherscoresareindeedmorelikelytoadvancetohigherclinicalphases,
indicating that our 5NN-RF classifiers are able to discriminate between high- and low-
potentialcandidates.
Atime-seriesanalysisofthedatasetsshowsgenerallyincreasingtrendsinperformanceover
five-year rolling windows from 2004 to 2014.We find that the classifiers’ performance
correlateswellwiththeproportionsofcompletecasesinthetrainingsets:ascompleteness
increases, the classifier learns better and achieves higher AUCs. This highlights the
importance of data quality in building more accurate predictive algorithms for drug
development.
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page42of44©2019byLo,Siah,andWong,AllRightsReserved
Finally,wecomputefeatureimportanceinthepredictivemodelsandfindthattrialoutcomes,
trialstatus,trialaccrualrates,duration,priorapprovalforanotherindication,andsponsor
track records are the most critical features for predicting success. Because the 5NN-RF
classifiersarenon-linear,thereisnosimpleinterpretationoftheincrementalcontribution
ofeachpredictortotheforecast.However,theintuitionbehindsomeofthesefactorsisclear:
drug-indication pairs with trials that achieve positive outcomes certainly have a better
chance of approval; candidates sponsored by companies with strong track records and
greater expertise in drug development should have higher likelihood of success; and
approveddrugsmayhavehigherchancesofapprovalforasecondrelatedindication.Many
ofthesefactorscontainusefulsignalsaboutdrugdevelopmentoutcomesbuthavenotbeen
consideredinpriorstudies.
These results are promising and raise the possibility of even more powerful drug
developmentpredictionmodelswithaccess tobetterqualitydata.This canbedrivenby
programssuchasProjectDataSphere(Greenetal.,2015)andVivli(Bierer,Li,Barnes,&Sim,
2016)thatpromoteandfacilitatepublicsharingofpatient-levelclinicaltrialdata.Ultimately,
suchpredictiveanalyticscanbeusedtomakemoreinformeddata-drivendecisionsinthe
riskassessmentandportfoliomanagementofinvestigationaldrugsatallclinicalstages.
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page43of44©2019byLo,Siah,andWong,AllRightsReserved
ReferencesBierer,B.E.,Li,R.,Barnes,M.,&Sim,I.(2016).Aglobal,neutralplatformforsharingtrialdata.New
EnglandJournalofMedicine,374(25),2411-2413.
Breiman,L.,Friedman,J.,Stone,C.J.,&Olshen,R.A.(1984).Classificationandregressiontrees.CRCPress.
Collins,L.M.,Schafer,J.L.,&Kam,C.M.(2001).Acomparisonofinclusiveandrestrictivestrategiesinmodernmissingdataprocedures.PsychologicalMethods,6(4),330.
DiMasi,J.A.,Hermann,J.C.,Twyman,K.,Kondru,R.K.,Stergiopoulos,S.,Getz,K.A.,&Rackoff,W.(2015). A tool for predicting regulatory approval after phase II testing of new oncologycompounds.ClinicalPharmacology&Therapeutics,98(5),506-513.
El-Maraghi, R.H., & Eisenhauer, E. A. (2008).Review of phase II trial designs used in studies ofmolecular targetedagents:outcomesandpredictorsofsuccess inphase III. JournalofClinicalOncology,26(8),1346-1354.
Enders,C.K.(2010).Appliedmissingdataanalysis.GuilfordPress.
Fawcett,T.(2006).AnintroductiontoROCanalysis.PatternRecognitionletters,27(8),861-874.
Fernandez, J. M., Stein, R. M., & Lo, A.W. (2012). Commercializing biomedical research throughsecuritizationtechniques.NatureBiotechnology,30(10),964-975.
Goffin,J.,Baral,S.,Tu,D.,Nomikos,D.,&Seymour,L.(2005).Objectiveresponsesinpatientswithmalignant melanoma or renal cell cancer in early clinical studies do not predict regulatoryapproval.ClinicalCancerResearch,11(16),5928-5934.
Green,A.K.,Reeder-Hayes,K.E.,Corty,R.W.,Basch,E.,Milowsky,M.I.,Dusetzina,S.B.,Bennett,A.V.,&Wood,W.A.(2015).Theprojectdatasphereinitiative:acceleratingcancerresearchbysharingdata.TheOncologist,20(5),464-e20.
Informa-PharmaceuticalClinicalTrialIntelligenceProducts.(2016).Informa.Retrieved5December2016, from https://pharmaintelligence.informa.com/products-and-services/data-and-analysis/citeline-joins-informas-pharma-intelligence
Jardim,D.L.,Groves,E.S.,Breitfeld,P.P.,&Kurzrock,R.(2017).Factorsassociatedwithfailureofoncology drugs in late-stage clinical development: A systematic review. Cancer TreatmentReviews,52,12-21.
Kuhn,M.,Weston,S.,Coulter,N.,&Quinlan,R.(2014).C50:C5.0decisiontreesandrule-basedmodels.Rpackageversion0.1.0-21.
Malik,L.,Mejia,A.,Parsons,H.,Ehler,B.,Mahalingam,D.,Brenner,A.,SarantopoulosJ.,&Weitman,S.(2014).PredictingsuccessinregulatoryapprovalfromPhaseIresults.CancerChemotherapyandPharmacology,74(5),1099-1103.
Pedregosa,F.,Varoquaux,G.,Gramfort,A.,Michel,V.,Thirion,B.,Grisel,O.,BlondelM.,PrettenhoferP.,WeissR.,DubourgV.,&Vanderplas,J.(2011).Scikit-learn:MachinelearninginPython.JournalofMachineLearningResearch,12(Oct),2825-2830.
Rubin,D.B.(1976).Inferenceandmissingdata.Biometrika,63(3),581-592.
Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American StatisticalAssociation,91(434),473-489.
HDSR
Issue1
19May2019 MachineLearningPredictionsofDrugApprovals Page44of44©2019byLo,Siah,andWong,AllRightsReserved
Scannell, J., Blanckley, A., Boldon, H. & Warrington, B. (2012). Diagnosing the decline inpharmaceuticalR&Defficiency.NatureReviewsDrugDiscovery11,191–200.
Schafer,J.L.(1997).Analysisofincompletemultivariatedata.CRCPress.
Schafer, J.L.,&Graham, J.W. (2002).Missingdata:ourviewof thestateof theart.PsychologicalMethods,7(2),147-177.
VanBuuren,S.(2012).Flexibleimputationofmissingdata.CRCPress.
Zarin,D.A.,Tse,T.,Williams,R.J.,&Carr,S.(2016).Trialreportinginclinicaltrials.gov—thefinalrule.NewEnglandJournalofMedicine,375(20),1998-2004.