machine learning with statistical imputation for...

46
HDSR Issue 1 Machine learning with statistical imputation for predicting drug approvals Andrew W. Lo 1,2,3 *, Kien Wei Siah 1,2 , Chi Heem Wong 1,2 1 Laboratory for Financial Engineering, Sloan School of Management, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America 2 EECS and CSAIL, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America 3 Santa Fe Institute, Santa Fe, New Mexico, United States of America * Corresponding author email: [email protected] (A.W.L.) Abstract We apply machine-learning techniques to predict drug approvals using drug-development and clinical-trial data from 2003 to 2015 involving several thousand drug-indication pairs with over 140 features across 15 disease groups. To deal with missing data, we use imputation methods that allow us to fully exploit the entire dataset, the largest of its kind. We show that our approach outperforms complete-case analysis, which typically yields biased inferences. We achieve predictive measures of 0.78, and 0.81 AUC (“area under the receiver operating characteristic curve,” the estimated probability that a classifier will rank a positive outcome higher than a negative outcome) for predicting transitions from phase 2 to approval and phase 3 to approval, respectively. Using five-year rolling windows, we document an increasing trend in the predictive power of these models, a consequence of improving data quality and quantity. The most important features for predicting success are trial outcomes, trial status, trial accrual rates, duration, prior approval for another indication, and sponsor track records. We provide estimates of the probability of success for all drugs in the current pipeline. Keywords: Drug development; Clinical trials; Healthcare; Machine learning; Imputation We thank Informa for providing us access to their data and expertise and are particularly grateful to Christine Blazynski, Mark Gordon, and Michael Hay for many helpful comments and discussion throughout this project. We also thank them and Anna Barker, Linda Blackerby, Lara Boro, Steve Finch, Howard Fingert, Xiao-Li Meng, Ellen Moore, Jim Reddoch, Lara Sullivan, Marty Tenenbaum, James Wade, and three anonymous referees for specific comments on this manuscript, and Jayna Cummings and Paige Sammartino for editorial assistance. Research support from the MIT Laboratory for Financial Engineering is gratefully acknowledged. The views and opinions expressed in this article are those of the authors only, and do not necessarily represent the views and opinions of any institution or agency, any of their affiliates or employees, or any of the individuals acknowledged above.

Upload: others

Post on 26-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • HDSR

    Issue1

    Machine learning with statistical imputation forpredictingdrugapprovals‡AndrewW.Lo1,2,3*,KienWeiSiah1,2,ChiHeemWong1,21LaboratoryforFinancialEngineering,SloanSchoolofManagement,MassachusettsInstituteofTechnology,Cambridge,Massachusetts,UnitedStatesofAmerica2EECSandCSAIL,MassachusettsInstituteofTechnology,Cambridge,Massachusetts,UnitedStatesofAmerica3SantaFeInstitute,SantaFe,NewMexico,UnitedStatesofAmerica*Correspondingauthor

    email:[email protected](A.W.L.)

    AbstractWeapplymachine-learningtechniquestopredictdrugapprovalsusingdrug-developmentandclinical-trialdatafrom2003to2015involvingseveralthousanddrug-indicationpairswith over 140 features across 15 disease groups. To deal with missing data, we useimputationmethodsthatallowustofullyexploittheentiredataset,thelargestofitskind.We show that our approach outperforms complete-case analysis, which typically yieldsbiasedinferences.Weachievepredictivemeasuresof0.78,and0.81AUC(“areaunderthereceiveroperatingcharacteristiccurve,”theestimatedprobabilitythataclassifierwillrankapositiveoutcomehigherthananegativeoutcome)forpredictingtransitionsfromphase2to approval and phase 3 to approval, respectively. Using five-year rolling windows, wedocumentan increasingtrend inthepredictivepowerof thesemodels,aconsequenceofimprovingdataqualityandquantity.Themostimportantfeaturesforpredictingsuccessaretrialoutcomes,trialstatus,trialaccrualrates,duration,priorapprovalforanotherindication,andsponsortrackrecords.Weprovideestimatesoftheprobabilityofsuccessforalldrugsinthecurrentpipeline.

    Keywords:Drugdevelopment;Clinicaltrials;Healthcare;Machinelearning;Imputation

    ‡WethankInformaforprovidingusaccesstotheirdataandexpertiseandareparticularlygratefultoChristineBlazynski,MarkGordon,andMichaelHayformanyhelpfulcommentsanddiscussionthroughoutthisproject.WealsothankthemandAnnaBarker,LindaBlackerby,LaraBoro,SteveFinch,HowardFingert,Xiao-LiMeng,EllenMoore,JimReddoch,LaraSullivan,MartyTenenbaum,JamesWade,andthreeanonymousrefereesforspecificcommentsonthismanuscript,andJaynaCummingsandPaigeSammartinoforeditorialassistance.ResearchsupportfromtheMITLaboratoryforFinancialEngineeringisgratefullyacknowledged.Theviewsandopinionsexpressedinthisarticlearethoseoftheauthorsonly,anddonotnecessarilyrepresenttheviewsand opinions of any institution or agency, any of their affiliates or employees, or any of the individualsacknowledgedabove.

  • HDSR

    Issue1Author contributions: Conceptualization, data-license agreement, A.W.L.; Methodology,A.W.L.,K.W.S.andC.H.W.;Investigation,K.W.S.;Writing(firstdraft),K.W.S.;Writing(review,editing,finaldraft),K.W.S.andC.H.W.;Visualization,K.W.S.;Projectsupervision,A.W.L.

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page1of44©2019byLo,Siah,andWong,AllRightsReserved

    1 IntroductionWhilemanyrecentmedicalbreakthroughssuchasimmuno-therapies,genetherapies,and

    gene-editing techniques offer new hope for patients, they have also made biomedical

    innovationriskier,andmorecomplexandexpensive.Thesebreakthroughsgeneratenovel

    therapiesforinvestigation,eachofwhichrequiresmanyyearsoftranslationalresearchand

    clinicaltesting,costinghundredsofmillionstobillionsofdollarsandyetoftenfaceahigh

    likelihoodoffailure(Fernandez,Stein,&Lo,2012).Infact,drugdevelopmentproductivity—

    theratioofthenumberofnewdrugsapprovedtoR&Dspendingeachyear—hasdeclined

    steadilyoverthepast50yearsdespitescientificandtechnicalprogress.Thisphenomenon,

    whichScannell,Blanckley,Boldon, andWarrington (2012) termed “Eroom’sLaw,” as the

    reverse of Moore’s Law, suggests that the cost of developing new drugs has doubled

    approximatelyeverynineyearssincethe1950s. In the faceofmultipleuncertainties, the

    needtoevaluatedrugcandidatesbetterandallocatecapitaltohigh-potentialopportunities

    moreefficientlyhasonlyintensified.

    Toaddresstheseneeds,inthisarticleweapplymachine-learningtechniquestopredictthe

    outcomesofrandomizedclinicaltrials.Machinelearningisaninterdisciplinaryfieldfocused

    on tackling pattern recognition problems and building predictive models tomake data-

    driven decisions, which is well-suited for this context. Successful applications of these

    techniqueshavealreadyrevolutionizedanumberofindustries(e.g.,advertising,marketing,

    financeandinsurance,oilandgasexploration)andarepoisedforevengreaterimpactvia

    autonomousvehicles,facial-recognitionauthentication,andgeneral-purposerobotics.

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page2of44©2019byLo,Siah,andWong,AllRightsReserved

    Drugdevelopershavealreadyappliedmachine-learningtoolstothediscoveryprocessvia

    high-throughputscreeningofvastlibrariesofchemicalandbiologicalcompoundstoidentify

    drug targets. However, in managing their portfolios of investigational drugs, biopharma

    companies typically use unconditional estimates of regulatory approval rates based on

    historically observed relative frequencies. Machine learning techniques yield conditional

    estimates of success, conditioned on a host of predictive factors known to affect the

    likelihood of approval, including drug compound characteristics, clinical trial design,

    previoustrialoutcomes,andthesponsortrackrecord.Weshowthatthesefeaturescontain

    usefulsignalsaboutdrugdevelopmentoutcomesthatwillallowustoforecasttheoutcome

    ofpipelinedevelopmentsmoreaccurately.

    Ourmethodologyandresultshaveseveralimplicationsforstakeholdersinthebiomedical

    ecosystem.Moreaccurateforecastsofthelikelihoodofsuccessofclinicaltrialswillreduce

    theuncertaintysurroundingdrugdevelopment,whichwillincreasetheamountofcapital

    thatinvestorsanddrugdevelopersarewillingtoallocatetothisendeavor.Byextension,this

    would lower the cost of capital and increase the efficiency of the allocation process.

    Specifically, we predict the probability of success of drug candidates in two scenarios:

    (1)advancing from phase 2 to regulatory approval and (2) from phase 3 to regulatory

    approval(seeFig1).Investorsanddrugdevelopersmayusesuchpredictionstoevaluatethe

    risksofdifferentinvestigationaldrugsatdifferentclinicalstages,providingthemwithmuch-

    needed transparency. Greater risk transparency is one source of improved financial

    efficiencybecauseitfacilitatesmoreaccuratematchingofinvestorriskpreferenceswiththe

    risksofbiomedicalinvestmentopportunities.

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page3of44©2019byLo,Siah,andWong,AllRightsReserved

    Machine-learningmodels can also offer guidance to scientists, clinicians, and biopharma

    professionalsas towhich factorsaremost important indeterminingclinical-trialsuccess,

    suggesting ways to improve the drug development process and decelerate or reverse

    Eroom’sLaw.

    Policymakers and regulators would also benefit from machine-learning predictions,

    particularlyfordrug-indicationpairsthatarepredictedtofailwithhighlikelihood—these

    cases highlight themostdifficult challenges in biomedicine and underscore the need for

    greatergovernmentandphilanthropicsupport.

    Fig1.Predictivemodelsforassessingtheprobabilityofapprovalofdrugcandidatesintwoscenarios:(1)afterphase2testing,and(2)afterphase3testing.

    Tothebestofourknowledge,ourstudyisthelargestofitskind.Weconstructtwodatasets,

    one for each scenario, from two proprietary pharmaceutical pipeline databases,

    Pharmaprojects and Trialtrove provided by Informa® (Informa, 2016). The phase-2-to-

    approvaldatasetincludesmorethan6,000uniquedrugsfor288indicationsandover14,500

    phase2trials,andthephase-3-to-approvaldatasetcontainsmorethan1,400uniquedrugs

    for253indicationsandover4,500phase3trials.Thesedatacoverover15indicationgroups.

    Incontrast,mostpublishedresearchondrugapprovalpredictionhaveverysmallsample

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page4of44©2019byLo,Siah,andWong,AllRightsReserved

    sizes,areconcentratedonspecifictherapeuticareas,andinvolveonlyoneorasmallnumber

    of predictive factors: Malik et al. (2014) examined the trial objective responses of 88

    anticanceragentsinphase1;Goffin,Baral,Tu,Nomikos,andSeymour(2005)studiedthe

    tumorresponseratesof58cytotoxicagentsin100phase1trialsand46agentsin499phase

    2trials;El-MaraghiandEisenhauer(2008)lookedattheobjectiveresponsesof19phase2

    anticancerdrugs in89singleagent trials; Jardim,Groves,Breitfeld, andKurzrock (2017)

    examinedtheresponseratesof80phase3oncologydrugstoidentifyfactorsassociatedwith

    failures;andDiMasietal.(2015)analyzed62cancerdrugsandproposedanapprovednew

    drug index (ANDI)algorithmwith four factors topredict approval for lead indications in

    oncology after phase 2 testing (see SupplementaryMaterials H for a comparison of our

    analysistotheirs).

    Anotherkeydifferenceinourapproachisthatwedealwithmissingdatausingstatistical

    imputation methods. We explore four common approaches to “missingness” and

    demonstratetheiradvantagesanddisadvantagesoverdiscardingincompletecases.Withthe

    FDA Amendments Act of 2007, drug and clinical trial data collection has been rapidly

    expanding,butthesedataareoftensparse,andourdatasetisnoexception.Relatedstudies

    (e.g.,DiMasietal.,2015)havetypicallyusedonlycomplete-caseobservations—discarding

    clinicaltrialswithanymissinginformation—whichtypicallyeliminateslargeportionsofthe

    dataandmayalsoleadtocertainbiases.

    Weusemachine-learningtechniquestoformourpredictions,includingcross-validationfor

    trainingandaheld-outtestingsetforperformanceevaluation,andusethestandard“area

    under the receiver operating characteristic curve” (AUC) metric to measure model

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page5of44©2019byLo,Siah,andWong,AllRightsReserved

    performance(AUCistheestimatedprobabilitythataclassifierwillrankapositiveoutcome

    higherthananegativeoutcome[Fawcett,2006]).WeachieveAUCsof0.78 forpredicting

    phase2toapproval(95%confidenceinterval(CI):[0.75,0.81])and0.81forpredictingphase

    3toapproval(95%CI:[0.78,0.83]).Atime-series,walk-forwardanalysisapproachshows

    similarresults.Wealsoapplyourmodelstothecurrentdrugpipeline—thatis,alldrugsstill

    indevelopmentasoftheendofourdataset—toidentifythecandidatesthathavethehighest

    and lowestprobabilitiesofsuccess.Weexaminethe latestdevelopmentstatusesof these

    pipelinedrug-indicationpairs—atrue“out-of-sample”experiment(validationondatanot

    used inmodel building)—and find that candidateswith higher scores are, indeed,more

    likelytoprogresstolaterclinicalstages.Thisindicatesthatourclassifiersdodiscriminate

    betweenhigh-andlow-potentialcandidates.

    2 MaterialsandmethodsDataThecommercialdatavendorInforma®offerstwodatabasesthatareusedinouranalysis:

    Pharmaprojects,whichspecializesindruginformation,andTrialtrove,whichspecializesin

    clinical trials information (Informa,2016).These twodatabasesaggregatedrugand trial

    informationfromover30,000datasourcesinmorethan150countries,includingcompany

    press releases, government drug databases (e.g., Drugs@FDA) and trial databases (e.g.,

    Clinicaltrials.gov [Zarin, Tse, Williams, & Carr, 2016], Clinicaltrialsregister.eu [extracted

    from EudraCT]), and scientific conferences and publications. Using these sources, we

    constructtwodatasetsofdrug-indicationpairs:phase2toapproval(P2APP)andphase3to

    approval(P3APP).WeextractclinicaltrialfeaturesfromTrialtrove,andaugmentthisdata

    usingdrug features fromPharmaprojects.Applyingmachine-learningalgorithms to these

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page6of44©2019byLo,Siah,andWong,AllRightsReserved

    datasetsallowsustoestimate:(1)whetheradrug-indicationpairthathasconcludedphase

    2testingwillbeapprovedeventually;and(2)whetherapair thathasconcludedphase3

    testingwillbeapprovedeventually.DatacleaningproceduresareoutlinedinSupplementary

    MaterialsA.

    Weconsiderallindicationsassociatedwithaparticulardrug,asopposedtoonlythelead

    indication.Weextract all features that could conceivably correlatewith the likelihoodof

    success, from drug compound attributes (31 features from Pharmaprojects profiles) to

    clinical trial characteristics (113 features fromTrialtrove). These features are defined in

    Table1andSupplementaryMaterialsA.Ingeneral,eachdatasetmaybepartitionedintotwo

    disjointsubsets:onewithsamplesthathaveknownoutcomes,andanotherwithsamples

    thatarestillinthepipelineatthetimeofsnapshotofthedatabases(thatis,theoutcomes

    areunknown).Toprovideintuitionforthecharacteristicsofthesamples,wedescribekey

    summarystatisticsofeachsubset.

    TheP2APPdatasetconsistsof6,344drug-indicationpairsthathaveendedphase2testing;

    thatis,therearenophase2trialsinprogressorplannedinthedatabase.Thephase2trials

    inthisdatasetrangefromAugust8,1990toDecember15,2015.Inoursample,4,812pairs

    haveknownoutcomes,while1,532pairsarestillinthepipeline.Inthesubsetwithknown

    outcomes, we define the development statuses of suspension, termination, and lack of

    developmentas“failures”(86.8%),andregistrationandlaunchas“successes”orapprovals

    (13.2%).TheP3APPdatasetconsistsof1,870pairsthathaveendedphase3testing,ofwhich

    1,610pairshaveknownoutcomes,while260pairsarestillinthepipeline.Forthosepairs

    withknownoutcomes,wedefine“failures”(59.1%)and“successes”(40.9%) inthesame

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page7of44©2019byLo,Siah,andWong,AllRightsReserved

    fashionas theP2APPdataset.Thephase3 trials inP3APPspan from January1,1988 to

    November1,2015.ThesefiguresaresummarizedinTable2.Here,theuseofterms“success”

    and“failure”isinthecontextofachievingapproval.Wenotethatourdefinitionof“failures”

    canincludedrugdevelopmentprogramsthatareterminatedduetofactorsunrelatedtothe

    performanceofthedrug(e.g.,marketconditions,businessdecisions).InSection3Results,

    wefindthatthisoutcomevariablehassignificantassociationswithtrialperformanceand

    otherfactors.

    Thedatasetscover15indicationgroups:alimentary,anti-infective,anti-parasitic,bloodand

    clotting, cardiovascular, dermatological, genitourinary, hormonal, immunological,

    musculoskeletal,neurological,anti-cancer,rarediseases,respiratory,andsensoryproducts.

    Anti-canceragentsmakeupthelargestsubgroupinP2APP,andthesecondlargestinP3APP

    (seeTable3).Industry-sponsoredtrialsdominatebothdatasets(seeTable4).Inaggregate,

    weobserveadecreasingtrendinsuccessratesoverfive-yearrollingwindowsfrom2003to

    2015(seeFig2).

    To the best of our knowledge, this sample is the largest of its kind. All prior published

    researchinthisliteratureinvolvedfewerthan100drugsor500trials(DiMasietal.,2015;

    El-Maraghi & Eisenhauer, 2008; Goffin et al., 2005; Malik et al., 2014). In addition, our

    datasetscoveradiversesetofindicationgroups,asopposedtoasingleareasuchasoncology.

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page8of44©2019byLo,Siah,andWong,AllRightsReserved

    Table1.DescriptionofparentfeaturesextractedfromPharmaprojectsandTrialtrove.Someparentfeaturesaremulti-label(e.g.,atrialmaybetaggedwithUnitedStatesandUnitedKingdomsimultaneously).Wetransformallmulti-labelparentfeaturesintobinarychildfeatures(1or0).SeeSupplementaryMaterialsAforspecificexamplesofeachfeature.Notethatdrug-indicationpairsforthesamedrughavethesamedrugfeatures;drug-indicationpairsinvolvedinthesametrialhavethesametrialfeatures.

    Description TypeDrugFeaturesRoute Routeofadministrationofthedrug,thepathbywhichthedrugistaken

    intothebody.Multi-label

    Origin Originoftheactiveingredientinthedrug. Multi-labelMedium Mediumofthedrug. Multi-labelBiologicaltargetfamily Familyofproteinsinthebodywhoseactivityismodifiedbythedrug,

    resultinginaspecificeffect.Multi-label

    Pharmacologicaltargetfamily

    Mechanismofactionofthedrug,thebiochemicalinteractionthroughwhichthedrugproducesitspharmacologicaleffect.

    Multi-label

    Drug-indicationdevelopmentstatus

    Currentphaseofdevelopmentofthedrugfortheindication. Binary

    Priorapprovalofdrugforanotherindication

    Approvalofthedrugforanotherindicationpriortotheindicationunderconsideration(specifictodrug-indicationpair).

    Binary

    TrialFeaturesDuration Durationofthetrial(fromreportedstartdatetoenddate)indays. ContinuousStudydesign Designofthetrial(keywords). Multi-labelSponsortype Sponsorsofthetrialgroupedbytypes. Multi-labelTherapeuticarea Therapeuticareastargetedbythetrial. Multi-labelTrialstatus Statusofthetrial. BinaryTrialoutcome Resultsofthetrial. Multi-labelTargetaccrual Targetpatientaccrualofthetrial. ContinuousActualaccrual Actualpatientaccrualofthetrial. ContinuousLocations Locationsofthetrialbycountry. Multi-labelNumberofidentifiedsites Numberofsiteswherethetrialwasconducted. ContinuousBiomarkerinvolvement Typeofbiomarkerinvolvementinthetrial. Multi-labelSponsortrackrecord Sponsor’ssuccessindevelopingotherdrugspriortothedrug-indication

    pairunderconsideration.Continuous

    Investigatorexperience Primaryinvestigator’ssuccessindevelopingotherdrugspriortothedrug-indicationpairunderconsideration.

    Continuous

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page9of44©2019byLo,Siah,andWong,AllRightsReserved

    Table2.SamplesizesofP2APPandP3APPdatasets.Weconsiderphase2trialinformationinP2APPdatasetsandphase3trialinformationinP3APPdataset.

    Counts Drug-indication

    PairsPhase2/3Trials

    UniqueDrugs

    UniqueIndications

    UniquePhase2/3Trials

    P2APPSuccess 635 2,563 540 173 2,486Failure 4,177 10,328 2,779 263 9,722Pipeline 1,532 2,815 1,189 221 2,713Total 6,344 15,706 4,073 288 14,584

    P3APPSuccess 659 1,830 572 171 1,801Failure 951 2,425 764 203 2,360Pipeline 260 494 240 120 480Total 1,870 4,749 1,451 253 4,552

    Table 3. Breakdown of drug-indication pairs by indication groups. A drug-indication pairmay havemultipleindicationgrouptags.Forinstance,renalcanceristaggedasbothanti-cancerandrarediseaseinPharmaprojects.

    Count P2APP P3APPAll 6,344 1,870Anti-cancer 2,239 409RareDiseases 1,105 259Neurological 1,069 444Alimentary 757 249Immunological 474 101Anti-infective 493 177Respiratory 428 134Musculoskeletal 394 121Cardiovascular 388 158Dermatological 254 45Genitourinary 210 85BloodandClotting 160 97Sensory 137 41Hormonal 17 4Anti-parasitic 8 0

    Table4.Breakdownoftrialsbysponsortypes.Atrialmaybesponsoredbymorethanoneparty(e.g.,collaborationbetweenindustrydevelopersandacademia).

    Counts P2APP P3APPAll 14,584 4,552OtherPharma 5,432 1,721Top20Pharma 5,322 2,369Academic 4,869 736Government 1,807 314CooperativeGroup 958 230NotforProfit 181 51Generic 52 54ContractResearchOrganization 41 17

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page10of44©2019byLo,Siah,andWong,AllRightsReserved

    Fig2.SuccessratesinP2APPandP3APPoverfive-yearrollingwindowsfrom2003-2015.

    MissingdataPriortothe2007FDAAmendmentsAct(FDAAA),itwasnotuncommonforinvestigatorsto

    release only partial information about pipeline drugs and clinical trials to protect trade

    secretsorsimplybecausetherewasnoincentivetodomore.Eventoday,someinvestigators

    still do not adhere to the FDAAA-mandated registration policy or submit adequate

    registrations.Therefore,allhistoricaldrugdevelopmentdatabaseshavemissingdata.We

    notethatthe“missingness”hereislargelyrelatedtothepost-studyreportingofclinicaltrial

    dataasopposedtoin-trialdatamissingness(e.g.,censorshipofpaneldataduetopatients

    terminatingtrialparticipationprematurely).Intheformercase,thedata(e.g.,trialduration,

    trialoutcomes)areusuallyavailabletotheinvestigatorsbutmaynotbereleasedpublicly,

    and are thus considered “missing” from our standpoint. Therefore, our dataset may be

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page11of44©2019byLo,Siah,andWong,AllRightsReserved

    consideredanapproximationof theactualdataavailable to theFDAand individualdrug

    developers.

    Fig3,Fig4,Table5,andTable6summarizethepatternsofmissingnessinourdataset(we

    excludepipelinedrug-indicationpairsherebecausetheiroutcomesarestillpending).The

    missingdatapatternsaremultivariate.Whenconditionedonthelatestlevelofdevelopment,

    foranyindication,wefindthatsuccessfuldrugsgenerallyhavelowerlevelsofmissingness

    comparedtofaileddrugs.Forinstance,intheP2APPdataset,61%offaileddrugshavean

    unknownmedium,while only 15%of approved drugs aremissing this feature.We also

    observe that completed trials tend tohavegreater levelsofmissingness than terminated

    trials.Betweentwodatasets,wefindthattheP3APPdataset,whichfocusesonphase3drugs

    andtrials,generallyhaslessmissingdataforbothdrugandtrialfeaturesthantheP2APP

    datasetwhichfocusesonphase2drugsandtrials.Thisisexpectedsincephase3trialsare

    primarilyusedtosupportregistrationfilings.

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page12of44©2019byLo,Siah,andWong,AllRightsReserved

    Fig3.Missingnesspatternsofdrugfeatures.Eachrowcorrespondstoauniquedrug.Featuresnotincludedinthefigurearecompleteanddonothavemissingvalues.Abbreviations:Devstatus:highestlevelofdevelopmentofadrugforanyindication;Pharma:pharmacologicaltargetfamily;Bio:biologicaltargetfamily.

    Fig4.Missingnesspatternsoftrialfeatures.Eachrowcorrespondstoauniqueclinicaltrial.Featuresnotincludedinthefigurearecompleteanddonothavemissingvalues.Abbreviations:Devstatus:highestlevelofdevelopmentof a drug for any indication; Status: trial status; Idsi: numberof identified sites; Actacc: actual accrual;Dura:duration;Taracc:targetaccrual;Loc:locations;Dkw:trialstudydesignkeywords;Outcome:trialoutcomes.

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page13of44©2019byLo,Siah,andWong,AllRightsReserved

    Table5.Missingnessindrugfeatureswithrespecttouniquedrugs(seeFig3).Thecolumnheading“Unconditional”referstooverallmissingnesswithoutconditioningonoutcome.

    Missingness Unconditional Success Failure

    P2APPRoute 0.04 0.00 0.04Pharmacologicaltargetfamily 0.06 0.02 0.07Biologicaltargetfamily 0.32 0.27 0.32Medium 0.53 0.15 0.61

    P3APPRoute 0.01 0.00 0.02Pharmacologicaltargetfamily 0.03 0.02 0.04Biologicaltargetfamily 0.27 0.24 0.30Medium 0.35 0.14 0.54

    Table6.Missingnessintrialfeatureswithrespecttouniquetrials(seeFig4).Thecolumnheading“Unconditional”referstooverallmissingnesswithoutconditioningontrialstatus.

    Missingness Unconditional Completion Termination

    P2APPNumberofidentifiedsites 0.10 0.10 0.10Actualaccrual 0.12 0.10 0.22Duration 0.26 0.29 0.05Targetaccrual 0.37 0.42 0.09Locations 0.02 0.02 0.02Studydesignkeywords 0.22 0.24 0.10Trialoutcomes 0.63 0.73 0.11

    P3APPNumberofidentifiedsites 0.10 0.09 0.12Actualaccrual 0.12 0.09 0.26Duration 0.17 0.19 0.06Targetaccrual 0.27 0.31 0.09Locations 0.01 0.01 0.02Studydesignkeywords 0.09 0.09 0.06Trialoutcomes 0.53 0.62 0.07

    Mostrelatedstudiesdonotreporttheextentofmissingdataintheirsamples,presumably

    becausesmallerdatasetswereused.DiMasietal.(2015)reportedmissingdataforsomeof

    theirfactors,andaddresseditthroughlistwisedeletion—deletingallobservationswithany

    missingfactors.Sincestatisticalestimatorsoftenrequirecompletedata,thisapproachisthe

    simplestremedyformissingness.However,itgreatlyreducestheamountofdataavailable

    anddecreasesthestatisticalpoweroftheresultingstatistics.Furthermore,listwisedeletion

    isvalidonlyunderstrictandunrealisticassumptions(seebelow),andwhensuchconditions

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page14of44©2019byLo,Siah,andWong,AllRightsReserved

    areviolated,inferencesarebiased.Inthecurrentstudy,wemakeanefforttoincludeinour

    analysis all observed examples, with or without complete features, through the use of

    statisticalimputation.

    Missingdatamaybeclassified intothreecategories(Rubin,1976):missingcompletelyat

    random (MCAR), missing at random (MAR), and missing not-at-random (MNAR). MCAR

    refers todatathataremissing forreasonsentirely independentof thedata;MARapplies

    whenthemissingnesscanbefullyaccountedforbytheobservedvariables;andMNARrefers

    tosituationswhenneitherMCARnorMARisappropriate,inwhichcasetheprobabilityof

    missingnessisdependentonthevalueofanunobservedvariable(VanBuuren,2012).See

    SupplementaryMaterialsBfortheprecisedefinitionsofeachtypeofmissingness.

    IfthemissingnessisMCAR,theobservedsamplescanbeviewedasarandomsubsampleof

    the dataset. Consequently, using listwise deletion should not introduce any bias. While

    convenient, this assumption is rarely satisfied in practice. In most drug-development

    databases,faileddrugsaremorelikelytohavemissingfeaturesthansuccessfuldrugs(see

    Table5).Clearly,MCARdoesnothold.

    Applyinglistwisedeletionwhenthemissingness isnotMCARcan leadtoseverelybiased

    estimates.Moreover,giventhenatureofdrug-developmentreporting,alargeportionofthe

    originaldatamaybediscardedifmanyvariableshavemissingvalues.Forthesereasons,the

    listwise-deletionapproachadoptedbyDiMasietal.(2015)andothersislessthanideal.

    Givenonlytheobserveddata,itisimpossibletotestforMARversusMNAR(Enders,2010).

    However, our knowledge of the data-collection process suggests thatMAR is a plausible

    startingpoint,andwehypothesizethatthemissingnessindrugandtrialfeaturesismainly

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page15of44©2019byLo,Siah,andWong,AllRightsReserved

    accountedforbydrugdevelopmentandtrialstatusesrespectively.OurobservationsinTable

    5andTable6supportthisapproach,asthemissingnessproportionsforsomefeaturesdiffer

    greatlydependingontheoutcome.

    OurassumptionofMARisconsistentwiththedata-collectionmethodologyintheInforma®

    databases.DrugprofilesarebuiltupovertimeinPharmaprojects.Asadrugadvancestolater

    phases, information about its characteristics becomes more readily available because

    investigators releasemore data about pipeline drugs after each phase of clinical testing.

    Informa®inputsthisinformationintoitsdatabasesastheybecomeavailableinthepublic

    domainorthroughprimaryresearch.Approveddrugsaremorelikelytohavemorecomplete

    profiles, while information about failed drugs tends to stay stagnant because no further

    studiesareconducted.ItisveryplausiblethattheMARnatureofourdatasetsisanartifact

    ofdatacollection,andbyextension,soaresimilarpharmaceuticaldatasetsextractedfrom

    thepublicdomainandmaintainedinthesamefashion.Originallyintendedtotrackdrugand

    trialactivities,PharmaprojectsandTrialtrovearenotstructuredtokeeptrackofinformation

    updatesovertimesincetherewasnouseforit.Withouttimestampsoftheupdates,weare

    notabletoeliminatetheMARartifactfromourdatasets.

    Inouranalysis,weimputethemissingdataunderthemoreplausibleMARassumptionto

    obtain complete datasets. In contrast to listwise deletion,we fill inmissing values using

    informationintheobservedvariables.Thisallowsustoutilizedatathatwouldotherwisebe

    discarded.Thereafter,wecanapplyalltheusualstatisticalestimatorsto this imputation-

    completeddata.

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page16of44©2019byLo,Siah,andWong,AllRightsReserved

    MethodsOur analysis consists of two parts thatwe perform in R version 3.2.3. First, we impute

    missingvaluestogeneratecompletedatasets.Next,weapplyarangeofmachine-learning

    algorithmstobuildpredictivemodelsbasedontheimputeddata.Illustrationofthespecific

    componentsofouranalysisappearinFig5.

    Fig5.Modelingmethodologyadoptedinthisstudy.Abbreviations:CV:cross-validation.

    Weformulateourtwoscenariosassupervisedbipartiterankingproblems,wherethegoalis

    topredict theoutcome—successor failure—ofadrug-indicationpairgivenasetof input

    features.Initially,wespliteachdatasetintotrainingandtestingsets.Foreachscenario,we

    trainvariousclassifiersbasedonthecorrespondingtrainingset,andcomputetheexpected

    errorofourpredictivemodelsbytestingthemontheheld-outtestingset.

    Wecreatefeaturematricesfromthedatasetsbyrepresentingdrugandtrialfeaturesforeach

    drug-indicationpairasvectors(seeFig6).Drug-indicationpairsassociatedwithmultiple

    trialsarerepresentedbythesamenumberoffeaturevectors,e.g.,apairwithtwotrialshas

    tworows.WegiveaconcreteexampleinFig6.Considerthedrug-indicationpairAnaliptin-

    diabetestype2intheP2APPdataset.Werepresentitusingtwovectorrowssinceithastwo

    phase2trialsinTrialtrove.Notethatthefeaturematrixisincompleteduetomissingdrug

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page17of44©2019byLo,Siah,andWong,AllRightsReserved

    andtrialfeatures.Wealsoconstructacolumnvectoroflabels,whichcontainstheoutcomes

    of the drug-indication pairs. Labels are not available for pipeline drug-indication pairs

    because theyare still indevelopmentand theiroutcomesare stilluncertain,hence these

    observationsarenotusedtotrainourclassifiers.However,withthetrainedclassifiers,we

    cangeneratepredictionsforpipelinedata.

    Wespliteachdataset(excludingpipelinedrugs-indicationpairs)intotwodisjointsets,one

    trainingsetandonetestingset,andformfeaturematricesforbothaccordingtothedrug-

    indicationpairsineachset.Thetestingsetsserveasout-of-sampledatasetstoevaluateour

    models.Therefore,wemasktheiroutcomes(thatis,wetreatthemasunknown)andaccess

    themonlyattheveryendtocheckourperformance.

    Fig6.Featurematrixofdataset.Eachrowcorrespondstoafeaturevector;eachfeaturecorrespondstoanentryinthevector;eachvectorhasalengthof144sincewehave31drugand113trialfeatures.Featurevectorsofalldrug-indicationpairsinthedatasetformthefeaturematrixcollectively.TrialIDisauniquetrialidentifierinTrialtrove.

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page18of44©2019byLo,Siah,andWong,AllRightsReserved

    Todealwithmissingdatainbothtrainingandtestingsets,weconsiderlistwisedeletionand

    four statistical imputation techniques commonly used in social science research and

    biostatistics: unconditional mean imputation, k-nearest neighbor (kNN) imputation,

    multipleimputation(MI),anddecision-treealgorithms(seeSupplementaryMaterialsCfor

    details). We follow best practices of the missing-data literature by including as many

    relevantauxiliaryvariablesaspossible,aswellasallvariablesusedinsubsequentmodels

    (Collins,Schafer,&Kam,2001;Enders,2010;Rubin,1996;Schafer&Graham,2002).This

    makestheassumptionofMARmoreplausibleinourdatasets,andhelpstoreducebiasin

    subsequent analyses (Schafer, 1997). In particular, it is necessary to include our target

    variable—the drug-indication development status—inour imputationmodel becausewe

    hypothesize thatmissingness ismainly accounted for by it. This is not an issue for the

    trainingsets.However,theoutcomesinthetestingsetsaremasked,andnotsupposedtobe

    known.Therefore,wetreatthetestingsetoutcomesasthoughtheyweremissingandimpute

    themtogetherwithalltheothermissingfeatures.Afterimputation,wediscardtheimputed

    testing-setoutcomes,anduseonlythe imputedfeaturevalues forpredictions.Wedothe

    samewhenevaluatingpipelinedatasets.

    Withrespect to themachine-learningalgorithm,weexploreseveral linearandnon-linear

    classifiers commonly used in this literature: penalized logistic regression (PLR), random

    forests(RF),neuralnetworks(NN),gradientboostingtrees(GBT),supportvectormachines

    with radial basis functions (SVM), and decision trees C5.0. We implement the first five

    algorithmsusingthescikit-learnpackageinPython(Pedregosaetal.,2011)andthesixth

    usingtheC50packageinR(Kuhn,Weston,Coulter,&Quinlan,2014).Fortraining,weweight

    each featurematrix rowexampleaccording to thenumberof trialsof the corresponding

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page19of44©2019byLo,Siah,andWong,AllRightsReserved

    drug-indicationpair.Inourearlierexample,thedrug-indicationpairAnaliptin-diabetestype

    2was involved in twophase2 trials. It is representedby twovector rows in the feature

    matrix(seeFig6).Bothrowsareusedastrainingexamples,andeachisweightedequally

    duringtraining(0.5,sincetherearetwotrials in total).Toobtainpredictions foradrug-

    indicationpair,weaveragetheoutputprobabilitiesandscoresofthecorrespondingfeature

    vectorrowsthatareusedasinputstotheclassifier.

    All machine-learning algorithms have hyper-parameters that affect the flexibility of the

    modelandmustbetunedtoeachdatasettooptimizegoodnessoffit.Poorly-chosenhyper-

    parameterscanleadtooverfitting(attributingsignaltonoise)orunderfitting(attributing

    noise to signal).We tuneourparametersusingk-fold cross-validation (withk=5or10,

    dependingonthesamplesize).Sincethecross-validationprocessshouldemulatethetesting

    processascloselyaspossible,weincludeimputationinthecross-validationloopaswell.We

    splitthetrainingsetintovalidationandnon-validationfolds.Thenwetreatvalidationfold

    outcomesasmissing,andimputethemaswewouldforatestingset.Fromhere,weignore

    theimputedvalidationfoldoutcomesandproceedwiththestandardvalidationprocess.

    Inthefinalstep,wetestthetrainedclassifiersontheunseentestingsetsforout-of-sample

    modelvalidation.Thisgivestheexpectedperformanceofourpredictivemodelsforeachof

    thescenarios,usingthestandardAUCmetrictomeasuremodelperformance.

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page20of44©2019byLo,Siah,andWong,AllRightsReserved

    3 ResultsSimulationoflistwisedeletionversusimputationWe study the effects of imputation using a “gold-standard” dataset derived from the

    completecasesoftheP2APPdataset(seeTable7).Tosimulatethemissingnesspresentin

    theoriginaldataset,we introducemissingness in thegold-standarddatasetbasedonour

    MARassumptionandthemissingnesspatternsobservedintheP2APPdataset.Werandomly

    splitthedrug-indicationpairsintoatrainingset(70%)andatestingset(30%),andusefive

    differentmissingdataapproaches,asdescribedinSupplementaryMaterialsC,togenerate

    completetrainingsetsfromtheMARtrainingset.Weuseeachimputedtrainingsettobuild

    six different predictive models (PLR, RF, NN, GBT, SVM, and C5.0) according to the

    methodologyoutlined inSection2.We repeat this experiment100 times for robustness.

    Table8summarizestheAUCperformanceoftheclassifiersonthegold-standardtestingsets.

    SeeSupplementaryMaterialsEforamoredetaileddescriptionandresults.

    Table7.Samplesizeofthegold-standarddataset(derivedfromcompletecasesofP2APP).

    Counts Drug-indication

    PairsPhase2Trials

    UniqueDrugs

    UniqueIndications

    UniquePhase2Trials

    Success 166 341 152 83 337Failure 812 1,672 503 158 1,549Total 978 2,013 623 171 1,872

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page21of44©2019byLo,Siah,andWong,AllRightsReserved

    Table8.AUCofdifferent classifiersunderdifferentmissingdataapproaches.Abbreviations:Avg:average;Sd:standarddeviation;5%:5thpercentile;50%:median;95%:95thpercentile;m:numberofimputationsgenerated.

    Gold-StandardTestingSetAUC

    ImputationMethod Machine-learningModel Avg Sd 5% 50% 95%

    Gold-Standard

    PLR

    0.810 0.028 0.761 0.808 0.853CompleteCases 0.755 0.040 0.683 0.764 0.813Mean/mode 0.778 0.031 0.729 0.779 0.823Median/mode 0.778 0.031 0.728 0.779 0.8245NN 0.786 0.032 0.738 0.787 0.83410NN 0.787 0.032 0.739 0.791 0.835MI(m=1) 0.781 0.036 0.722 0.777 0.843MI(m=10) 0.782 0.031 0.729 0.782 0.831Gold-Standard

    RF

    0.837 0.027 0.793 0.837 0.876CompleteCases 0.764 0.048 0.685 0.772 0.830Mean/mode 0.775 0.031 0.726 0.771 0.822Median/mode 0.774 0.031 0.723 0.774 0.8275NN 0.805 0.033 0.755 0.805 0.85710NN 0.802 0.033 0.747 0.805 0.856MI(m=1) 0.797 0.033 0.748 0.795 0.853MI(m=10) 0.804 0.030 0.751 0.804 0.848Gold-Standard

    NN

    0.800 0.032 0.754 0.799 0.849CompleteCases 0.715 0.043 0.638 0.716 0.779Mean/mode 0.790 0.037 0.739 0.789 0.848Median/mode 0.789 0.036 0.740 0.792 0.8495NN 0.794 0.032 0.743 0.798 0.84210NN 0.797 0.036 0.737 0.798 0.851MI(m=1) 0.780 0.036 0.719 0.781 0.838MI(m=10) 0.795 0.030 0.750 0.795 0.838Gold-Standard

    GBT

    0.820 0.028 0.776 0.821 0.868CompleteCases 0.746 0.050 0.659 0.756 0.816Mean/mode 0.781 0.034 0.724 0.784 0.826Median/mode 0.778 0.033 0.719 0.783 0.8235NN 0.796 0.029 0.737 0.798 0.83710NN 0.796 0.028 0.748 0.798 0.838MI(m=1) 0.796 0.031 0.747 0.796 0.847MI(m=10) 0.804 0.031 0.757 0.803 0.854Gold-Standard

    SVM

    0.785 0.030 0.730 0.786 0.831CompleteCases 0.733 0.053 0.650 0.741 0.795Mean/mode 0.766 0.036 0.707 0.771 0.818Median/mode 0.764 0.035 0.711 0.771 0.8185NN 0.771 0.034 0.722 0.770 0.82710NN 0.772 0.037 0.710 0.773 0.825MI(m=1) 0.760 0.035 0.696 0.762 0.813MI(m=10) 0.768 0.030 0.719 0.764 0.813Gold-Standard

    C5.0

    0.800 0.033 0.758 0.800 0.844CompleteCases 0.710 0.063 0.585 0.713 0.802Mean/mode 0.758 0.039 0.698 0.762 0.816Median/mode 0.754 0.043 0.679 0.751 0.8235NN 0.772 0.038 0.715 0.772 0.84310NN 0.770 0.035 0.710 0.771 0.822MI(m=1) 0.758 0.037 0.701 0.754 0.819MI(m=10) 0.807 0.031 0.756 0.808 0.857

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page22of44©2019byLo,Siah,andWong,AllRightsReserved

    Forallsixmachine-learningalgorithms,wefindthatgold-standardclassifiers—thatis,the

    modelsderivedfromcompletedata—consistentlyoutperformtheircomplete-caseanalysis

    and imputation counterparts.This is logicalbecauseuseful information is invariably lost

    whenwe introducemissingness in thedatasets. Incontrast,complete-caseanalysisoften

    leadstoinferiorperformance.TheAUCsofclassifierstrainedoncomplete-casestrainingsets

    tendtobesmallerthanthosetrainedonimputedtrainingsets.Thissuggeststhatimputation

    doesindeedofferimprovedfitandpredictivepoweroverlistwisedeletion.

    Overall,wefindkNNimputationtobemostcompatiblewithourdatasets.Itprovidesthe

    leastbiasedimputationsamongallmissingdatamethods(seeSupplementaryMaterialsE).

    Inparticular,thecombinationofkNNimputation(k=5)withRFgivesoneofthehighest

    gold-standard testing set AUCs (0.81). We note a few other MI combinations that yield

    comparable or marginally better performance but focus on the 5NN-RF approach in

    subsequentanalysesonthemaindatasetsduetoitseaseofimplementationandapplication.

    WefindthatSVMhastheworstperformanceamongallmachine-learningmodels.Thisisnot

    surprisingbecauseSVMsareaimedonlyatlearningbinaryclassifiers,anddonotgenerally

    producegoodclassprobabilityestimates.Consequently,suchmodelsdonotnecessarilygive

    highAUCs.

    WealsocompareourapproachwiththeANDIalgorithm(DiMasietal.,2015)byapplyinga

    modified version of the index on oncology drugs in the gold-standard testing sets (see

    SupplementaryMaterialsHforamorein-depthdescription).Wefindthatour5NN-RFmodel

    achievessignificantlyhigherAUCthanthemodifiedANDI,withanaverageimprovementof

    0.1inAUCover100simulations(seeFig7).Webelievethatthisgaincanbeattributedtoa

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page23of44©2019byLo,Siah,andWong,AllRightsReserved

    largertrainingsetwithawiderrangeof features,anonlinearmodel thatcancapturethe

    complexrelationshipsinthedata,andapropermodelvalidationmethodology.

    Fig7.DistributionsofAUCof5NN-RFandthemodifiedANDIononcology-onlygold-standardtestingsets.

    PredictingapprovalsWeanalyzethetwodatasets(P2APPandP3APP)byfirstsplittingeachintoatrainingset

    (70%)andatestingset(30%)randomly(pipelinedrug-indicationpairsareomittedsince

    theiroutcomeshaveyettobedetermined).Subsequently,wetrain5NN-RFmodelsforeach

    scenarioaccordingtothemethodologyoutlinedinSection2.Werepeatthisexperiment100

    timesforrobustness.Table9summarizestheAUCperformancemetricsforthetestingsets.

    Onaverage,weachieve0.78AUCforP2APPand0.81AUCforP3APP.

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page24of44©2019byLo,Siah,andWong,AllRightsReserved

    Table 9. Comparison of the general and indication-group specific classifiers for selected indication groups.Abbreviations:Avg:average;Sd:standarddeviation;5%:5thpercentile;50%:median;95%:95thpercentile.

    GeneralClassifier SpecializedClassifiers Avg Sd 5% 50% 95% Avg Sd 5% 50% 95%

    P2APPAll 0.777 0.017 0.749 0.775 0.806 - - - - -Anti-cancer 0.805 0.025 0.764 0.805 0.847 0.818 0.029 0.773 0.819 0.865RareDiseases 0.800 0.028 0.756 0.800 0.848 0.775 0.036 0.715 0.777 0.838Neurological 0.767 0.036 0.710 0.769 0.819 0.778 0.039 0.721 0.779 0.834Alimentary 0.749 0.045 0.672 0.751 0.817 0.732 0.048 0.651 0.734 0.807Immunological 0.783 0.065 0.665 0.786 0.889 0.766 0.069 0.646 0.775 0.860Anti-infective 0.735 0.043 0.673 0.736 0.800 0.750 0.047 0.684 0.746 0.832Respiratory 0.756 0.055 0.648 0.764 0.835 0.867 0.043 0.794 0.872 0.921Musculoskeletal 0.822 0.049 0.736 0.821 0.899 0.731 0.076 0.614 0.745 0.849Cardiovascular 0.709 0.072 0.580 0.711 0.812 0.694 0.073 0.579 0.698 0.807Genitourinary 0.633 0.086 0.503 0.634 0.790 0.706 0.091 0.552 0.710 0.840

    P3APPAll 0.810 0.018 0.781 0.810 0.834 - - - - -Anti-cancer 0.783 0.047 0.699 0.779 0.853 0.707 0.054 0.612 0.714 0.786RareDiseases 0.819 0.054 0.727 0.822 0.896 0.786 0.058 0.687 0.793 0.875Neurological 0.796 0.037 0.734 0.794 0.857 0.789 0.038 0.741 0.787 0.853Alimentary 0.817 0.047 0.744 0.820 0.891 0.805 0.054 0.718 0.808 0.888Immunological 0.811 0.074 0.680 0.815 0.910 0.757 0.099 0.586 0.765 0.892Anti-infective 0.757 0.065 0.644 0.752 0.854 0.708 0.068 0.600 0.707 0.808Respiratory 0.823 0.065 0.712 0.831 0.920 0.773 0.083 0.627 0.784 0.907Musculoskeletal 0.741 0.095 0.576 0.747 0.866 0.763 0.072 0.646 0.762 0.882Cardiovascular 0.794 0.058 0.702 0.788 0.887 0.755 0.076 0.639 0.765 0.864Genitourinary 0.814 0.083 0.670 0.821 0.937 0.801 0.090 0.635 0.808 0.927

    The observed performance is essentially the MAR testing set AUC, since backfilling has

    alreadyaffectedthedatasetsused.InSupplementaryMaterialsE,wehighlighttheperilsof

    relyingontheMARtestingsetformodelvalidation,andsuggestthattheAUCsforthegold-

    standard and MCAR testing sets are more reflective of a classifier’s real performance.

    Unfortunately, we have access to neither the gold-standard nor the MCAR testing sets,

    becausewedonotknowthetrue,underlyingvaluesofthemissingfeatures.However,our

    experiments indicate that the AUCs for the MAR and MCAR testing sets of the 5NN-RF

    combinationareveryclose(adifferenceof0.002onaverage).Thismeansthatwemayuse

    theformer,theonlyobservedfigure,asareasonableestimateofthelatter,whichreflects

    realperformance.

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page25of44©2019byLo,Siah,andWong,AllRightsReserved

    Next,wetrainclassifiersbasedontheunionofthetrainingandtestingsets,andusethemto

    generatepredictionsforpipelinedrug-indicationpairs.WegeneratepredictionsforP2APP

    usingonlyinformationfromphase2trialsandforP3APPusingonlyinformationfromphase

    3trials.WhilewecannotcomputeAUCscoresforthesesamplesbecausetheiroutcomesare

    stillpending,wecancomparetheirpredictionscoreswiththeirdevelopmentstatusatthe

    timeof thiswriting.Thesepipelinedrug-indicationpairsmaystillbe in thesameclinical

    stage(nochange,i.e.,phase2forP2APP;phase3forP3APP),beterminated(failed),orhave

    progressedtohigherphases(advanced).

    Fig8,Table10,andTable11summarizethedistributionsofpipelinepredictionscores.We

    findthatpairsthatfailgenerallyhavelowerscoresthanthosethatadvancetolaterphases

    ofdevelopment.InFig8,weobservepeaksatthelowerendofthescorespectrumforfailed

    pairs(red) forbothdatasets. Incontrast,pairs thatadvancetendtohavepeaksathigher

    scores(green).Weobservethesamepatternswhenwedisaggregatethedistributionsby

    indicationgroups:thegreenpartstendtoclusterabovethedistributionmedianwhilethe

    redpartsclusterbelow.However,therearealsosomeindicationgroupsforwhichthereare

    toofewsamplestomakeanyusefulremarks(e.g.,hormonalproductsinP2APP).FromTable

    10,weseethattheaveragescoresoffailedpairsareindeedlowerthanthosethatadvance

    (differencesrangingfrom0.05to0.15).InTable11,webindrug-indicationpairsthathave

    newdevelopments (whether failure or advancement) into four groupings, dependingon

    theirpredictionscores.Foreachbin,wecomputetheproportionofsamplesthatadvanceto

    laterdevelopmentstages.We find that theproportionsgenerally increasewith thescore

    magnitude,suggestingthatpairswithhigherscoresaremorelikelytoadvancethanthose

    with lower scores.Wenote thatprogress to later clinical stagesdoesnotalways lead to

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page26of44©2019byLo,Siah,andWong,AllRightsReserved

    approval. However, the results are still promising because advancement is a necessary

    condition for approval. Our experiments indicate that our trained classifiers are able to

    discriminatebetweenhigh-andlow-potentialcandidates.

    Table10.Distributionsofpredictionscoresforallindicationgroupsinaggregate(seeFig8).Advancedreferstoprogresstoahigherphasefromtheoriginalphase.OriginalphaseforP2APPisphase2;forP3APPisphase3.Forinstance,outof1,511drug-indicationpairsintheP2APPtestingset,859pairsarestillpendingdecisioninphase2,244pairshavefailedand408pairshavesuccessfullyadvancedtophase3testing.Abbreviations:n:samplesize;Avg:average;Sd:standarddeviation;5%:5thpercentile;50%:median;95%:95thpercentile.

    PredictionScores n Avg Sd 5% 50% 95%

    P2APPAggregate 1,511 0.153 0.061 0.044 0.155 0.258Nochange 859 0.143 0.060 0.041 0.147 0.246Failed 244 0.137 0.061 0.034 0.147 0.240Advanced 408 0.183 0.056 0.093 0.178 0.274

    P3APPAggregate 252 0.417 0.189 0.128 0.402 0.695Nochange 142 0.392 0.185 0.129 0.384 0.693Failed 32 0.348 0.185 0.100 0.344 0.656Advanced 78 0.492 0.176 0.233 0.492 0.699

    Table11.Distributionsofpredictionscoresforallindicationgroupsinaggregate(seeFig8).Proportionreferstothefractionofsamplesthatadvancedtoalaterphasefromtheoriginalphase.Abbreviations:n:samplesize.

    Scores n ProportionP2APP

    <0.1 108 0.2310.1-0.2 368 0.6710.2-0.3 171 0.766≥0.3 5 1.000

    P3APP<0.2 13 0.3080.2-0.4 35 0.6860.4-0.6 27 0.667≥0.6 35 0.914

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page27of44©2019byLo,Siah,andWong,AllRightsReserved

    Fig8.DistributionsofpredictionscoresforP2APPandP3APP.Firstrowforallindicationgroupsinaggregate.Subsequentrowsforspecificindicationgroups.

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page28of44©2019byLo,Siah,andWong,AllRightsReserved

    To gain insight into the logic of our trained predictivemodels,we compute the average

    importanceoffeaturesusedinthe5NN-RFclassifiersoveralltheexperiments,andextract

    thetoptenmostinformativevariables.TheRFclassifier(Pedregosaetal.,2011)weused

    computestheimportanceofavariablebyfindingthedecreaseinnodeimpurityforallnodes

    thatsplitonthatvariable,weightedbytheprobabilityofreachingthatnode(asestimated

    by the proportion of samples reaching that node), averaged over all trees in the forest

    ensemble(Breiman,Friedman,Stone,&Olshen,1984).Table12summarizestheresults.

    Table12.Toptenimportantvariablesof5NN-RFclassifiersforP2APPandP3APP.Averageandstandarddeviationtakenacrossallexperiments.Abbreviations:Avg:average;Sd:standarddeviation.

    Importance Avg Sd

    P2APPTrialoutcome–completed,positiveoutcome,orprimaryendpoint(s)met 0.234 0.043Trialstatus 0.160 0.026Medium–solution 0.051 0.018Actualaccrual 0.046 0.010Sponsortype–industry,allotherpharma 0.025 0.008Sponsortrackrecord–numberofpositivephase3trials 0.023 0.006Sponsortrackrecord–numberoffaileddrug-indicationpairs 0.021 0.007Studydesign–placebocontrol 0.019 0.009Targetaccrual 0.018 0.005Priorapprovalofdrugforanotherindication 0.018 0.007

    P3APPTrialoutcome–completed,positiveoutcome,orprimaryendpoint(s)met 0.357 0.028Trialstatus 0.148 0.014Duration 0.099 0.016Trialoutcome–terminated,lackofefficacy 0.033 0.010Trialoutcome–completed,negativeoutcome,orprimaryendpoint(s)notmet 0.033 0.008Therapeuticarea–oncology 0.030 0.009Priorapprovalofdrugforanotherindication 0.021 0.007Actualaccrual 0.015 0.003Medium–powder 0.014 0.007Medium–solution 0.012 0.006

    Wefindthattrialoutcome(whetherthetrialwascompletedwithitsprimaryendpointsmet)

    and trial status (whether the trial was completed or terminated) have significant

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page29of44©2019byLo,Siah,andWong,AllRightsReserved

    associationswithsuccess.Thesetwofeatureswereconsistentlyrankedthetoptwooutof

    allvariablesandacrossbothdatasets.Itiseasytoimaginethatadrug-indicationpairwhose

    trialswereterminatedhasalowprobabilityofsuccessintermsofadvancingfromphases

    2/3 to approval. In contrast, candidates that achieve positive outcomes certainly have a

    bettershotatsuccess.Wealsoobservethatpriorapprovalofadrughasaneffectonsuccess

    fornewindicationsorpatientsegmentation.Itisplausiblethatdevelopinganapproveddrug

    foranewindicationhasagreaterlikelihoodofsuccessthananewcandidate.

    Inaddition,trialcharacteristicssuchasaccrual,duration,andthenumberofidentifiedsites

    frequentlyappearinthetoptenimportantvariables.Thereareseveralpossibleexplanations.

    Forexample,trialsthatendquicklywithoutachievingprimaryendpointsmayundermine

    the likelihood of success, and drugs with trials that have small accrual—and thus low

    statisticalpower—mayhavealowerprobabilityofbeingapproved.

    Wealsofindsponsortrackrecords—quantifiedbythenumberofpastsuccessfultrials(trials

    thatachievepositiveresultsormeetprimaryendpoints)—tobeausefulfactorforprediction.

    This factor has not been considered in previous related studies, but the intuition for its

    predictivepowerisclear:strongtrackrecordsarelikelyassociatedwithgreaterexpertisein

    drugdevelopment.

    Sincedrugsdevelopedfordifferentindicationgroupsmayhaveverydifferentcharacteristics,

    wemightexpectclassifierstrainedonindication-group-specificdatatooutperformgeneral

    classifiers.We build and analyze such specialized classifiers by filtering the datasets by

    indicationgroupbeforeperformingtheexperimentdescribedintheprevioussection.Asa

    comparison,wealsobreakdown theperformanceof thegeneral classifiersby indication

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page30of44©2019byLo,Siah,andWong,AllRightsReserved

    group. Table 9 shows the results for selected indication groups. In general, we find

    specializedmodelstogivepoorerperformancethangeneralmodels.Thisislikelybecause

    theformeraretrainedonlessdata,whichmakesthemlessaccurateandmoresusceptibleto

    overfitting.

    We note that the approach adopted in this section—splitting drug-indication pairs into

    trainingandtestingsetsrandomlywithoutconsideringthedatesofdevelopment—maybe

    less than idealbecauseof look-aheadbias.Forexample, if the resultsof a2008 trial are

    included inthetrainingset forpredictingtheoutcomeofa2004developmentpath fora

    drug-indicationpair,ourmodelwillbeusingfutureinformationduringvalidation,whichcan

    yieldmisleadingand impractical inferences.Toaddressthis issue, in thenextsectionwe

    apply our machine-learning framework to time-series data using rolling windows that

    accountfortemporalorderingintheconstructionoftrainingandtestingsets.Althoughthis

    processmakesuseoflessdatawithineachestimationwindowthanwhentheentiredataset

    isused,itminimizestheimpactoflook-aheadbiasandyieldsmorerealisticinferences.We

    studytheeffectsofrandomsplittingversustemporalorderinginSupplementaryMaterials

    J.

    PredictionsovertimeDrugdevelopmenthaschangedsubstantiallyovertime,thankstonewscientificdiscoveries

    and technological improvements.To reflect these changes inourpredictiveanalytics,we

    adoptatimeseries,walk-forwardapproachtocreatetrainingandtestingsetsforeachofthe

    twodatasets,P2APPandP3APP(seeFig9).Wesamplefive-yearrollingwindowsbetween

    2004and2014fromeachdataset.Eachwindowconsistsofatrainingsetofdrug-indication

    pairswhoseoutcomesbecomefinalizedwithinthewindow,andanout-of-sample,out-of-

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page31of44©2019byLo,Siah,andWong,AllRightsReserved

    timetestingsetofdrug-indicationpairsthatendedphase2orphase3testing,butarestillin

    thepipelinewithundeterminedoutcomeswithin thewindow.Forexample, consider the

    P2APPdataset.Wedrawthe firstwindowfrom2004–2008, trainouralgorithmondrug-

    indicationpairsthatfailedorwereapprovedwithinthisperiodasthetrainingset,andapply

    thetrainedmodeltopredicttheoutcomesofdrug-indicationsthatjustendedphase2testing

    withinthesamewindowasthetestingset.

    Fig9.Time-serieswalk-forwardanalysisapproach.Thetestingset inthelastwindow(green)comprisesdrug-indicationpairsinthepipelineatthetimeofsnapshotofthedatabases.

    We evaluate the resulting classifier by comparing its predictionswith outcomes that are

    realized in the future (2009–2015). This rolling-window approach yields a total of eight

    overlapping trainingand testingperiodswhereanew5NN-RFmodel is trained foreach

    period.Theeighthtestingperiodconsistsofdrug-indicationpairsinthepipelineatthetime

    ofsnapshotofthedatabases.Unlikethefirstsevenperiods,theiroutcomesarestillpending

    current development, and thereforewe cannot compute a testing AUC for this window.

    However,wecanexaminethepredictionsandcomparethescoreswiththeirdevelopment

    statusesatthetimeofthiswriting.

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page32of44©2019byLo,Siah,andWong,AllRightsReserved

    Fig10summarizestheresultsofthetime-seriesanalysisforthefirstsevenwindows.We

    observeanincreasingtrendovertheyearsforbothP2APP(0.67inthefirstand0.80inthe

    lastwindow)andP3APP(0.77inthefirstand0.88inthelastwindow).Interestingly,wenote

    thattheproportionsofcompletecasesinthetrainingsetscorrelatewellwiththetimeseries

    AUC(correlationcoefficient0.95forP2APPand0.90forP3APP).Wecomputetheproportion

    ofcompletecasesbytakingthenumberoffeaturevectorrowswithcompleteinformation

    over the total number of rows. As is apparent from Fig 10, the proportions have been

    increasing over the years for both datasets. This is likely due to better data reporting

    practicesbydrugdevelopers,apossibleconsequenceofFDAAA.

    Fig10.Time-serieswalk-forwardanalysisforP2APPandP3APPusing5NN-RF.Weusebootstrappingtodeterminethe95%CIforAUC(dottedlines).Thedashedlinesplotthecorrespondingproportionsofcompletecasesinthetrainingsetsofeachfive-yearwindow.Abbreviations:CC:proportionofcompletecases.

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page33of44©2019byLo,Siah,andWong,AllRightsReserved

    Next,weexaminethe2011–2015window.Fig11,Table13,andTable14summarizethe

    distributions of prediction scores for the P2APP and P3APP datasets. We observe very

    similarpatternstothestaticpipelinepredictionsabove.Thehistograms,averagescores,and

    binningofsamples indicatethatpairs that fail tendtohave lowerpredictionscoresthan

    thosethatadvance.Thisshowsthatourclassifiersareindeedabletodifferentiatesuccessful

    candidates.

    Table13.Distributionsofpredictionscoresforallindicationgroupsinaggregate(seeFig11).Advancedreferstoprogresstoahigherphasefromtheoriginalphase.OriginalphaseforP2APPisphase2; forP3APPisphase3.Abbreviations:n:samplesize;Avg:average;Sd:standarddeviation;5%:5thpercentile;50%:median;95%:95thpercentile.

    PredictionScores n Avg Sd 5% 50% 95%

    P2APPAggregate 1,190 0.158 0.080 0.036 0.173 0.290Nochange 712 0.148 0.080 0.035 0.158 0.275Failed 195 0.143 0.079 0.034 0.149 0.255Advanced 283 0.197 0.071 0.068 0.200 0.323

    P3APPAggregate 218 0.431 0.211 0.113 0.476 0.689Nochange 121 0.395 0.207 0.113 0.403 0.684Failed 28 0.362 0.211 0.093 0.335 0.640Advanced 69 0.521 0.193 0.149 0.631 0.707

    Table14.Distributionofpredictionscoresforallindicationgroupsinaggregate(seeFig11).Proportionreferstothefractionofsamplesthatadvancedtoahigherphasefromtheoriginalphase.Abbreviations:n:samplesize.

    Scores n ProportionP2APP

    <0.1 99 0.3130.1-0.2 183 0.6070.2-0.3 168 0.690≥0.3 28 0.893

    P3APP<0.2 17 0.4120.2-0.4 17 0.7060.4-0.6 17 0.647≥0.6 46 0.848

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page34of44©2019byLo,Siah,andWong,AllRightsReserved

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page35of44©2019byLo,Siah,andWong,AllRightsReserved

    Fig11.Distributionsofpredictionscoresofthe2011–2015windowtestingsetforP2APPandP3APP.Firstrowforallindicationgroupsinaggregate.Subsequentrowsforspecificindicationgroups.

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page36of44©2019byLo,Siah,andWong,AllRightsReserved

    Table15summarizesthetoptenmostinformativevariablesinthe5NN-RFclassifiersover

    theeightrollingwindows.Wefindthemtobelargelyconsistentwiththoseobservedinthe

    staticcase:thetrialoutcomeandtrialstatusaresignificantlyassociatedwithsuccess;trial

    characteristics (such as accrual, duration, and number of identified sites), sponsor track

    record,anddrugmediumappearfrequentlyinbothscenarios.

    Table15.Toptenimportantvariablesin5NN-RFclassifiersforP2APPandP3APP.Averageandstandarddeviationtakenacrosstheeightrollingwindows.Abbreviations:Avg:average;Sd:standarddeviation.

    Importance Avg Sd

    P2APPTrialoutcome–completed,positiveoutcome,orprimaryendpoint(s)met 0.203 0.083Trialstatus 0.102 0.033Priorapprovalofdrugforanotherindication 0.077 0.061Actualaccrual 0.039 0.015Targetaccrual 0.031 0.010Duration 0.027 0.014Sponsortrackrecord–numberofcompletedphase3trials 0.025 0.007Medium–suspension 0.024 0.018Sponsortype–academic 0.023 0.017Medium–solution 0.021 0.019

    P3APPTrialoutcome–completed,positiveoutcome,orprimaryendpoint(s)met 0.348 0.028Trialstatus 0.125 0.020Duration 0.053 0.017Priorapprovalofdrugforanotherindication 0.046 0.028Trialoutcome–completed,negativeoutcome,orprimaryendpoint(s)notmet 0.033 0.026Targetaccrual 0.021 0.005Trialoutcome–terminated,lackofefficacy 0.020 0.013Actualaccrual 0.019 0.004Therapeuticarea–oncology 0.017 0.013Numberofidentifiedsites 0.012 0.002

    Asinthestaticcase,wealsotrainindication-groupspecificclassifiersusingrollingwindows.

    Table16andTable17summarizetheresultsforselectedindicationgroupsinP2APPand

    P3APP, respectively (see Supplementary Materials G for results of all other indication

    groups). Indication groups with small sample sizes tend to produce poor and unstable

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page37of44©2019byLo,Siah,andWong,AllRightsReserved

    specializedclassifiers(e.g.,themusculoskeletalindicationgroupinP2APP).Thisisexpected

    becausemodelstrainedonsmalltrainingsetsaremoresusceptibletooverfitting,especially

    whennon-linearalgorithmssuchasRFareused.Incontrast,indicationgroupswithlarger

    samplesizestendtogiverisetorathergoodclassifiers(e.g.,anti-cancerinP2APP).

    Table16.Comparisonof thegeneraland indication-groupspecific classifiers for selected indicationgroups inP2APP.Weusebootstrappingtodeterminethe95%CIforAUC.

    GeneralClassifier SpecializedClassifiers TrainSet TestSet AUC[95%CI] TrainSet TestSet AUC[95%CI]

    All2004–2008 1,361 551 0.669[0.614,0.725] - - -2005–2009 1,562 591 0.680[0.625,0.735] - - -2006–2010 1,764 636 0.712[0.668,0.755] - - -2007–2011 1,969 598 0.738[0.698,0.777] - - -2008–2012 2,082 597 0.799[0.760,0.837] - - -2009–2013 2,212 517 0.823[0.779,0.867] - - -2010–2014 2,289 380 0.797[0.718,0.876] - - -

    Anti-cancer2004–2008 1,361 137 0.665[0.528,0.803] 456 137 0.683[0.533,0.833]2005–2009 1,562 163 0.739[0.618,0.861] 494 163 0.635[0.512,0.758]2006–2010 1,764 188 0.774[0.702,0.846] 546 188 0.726[0.635,0.816]2007–2011 1,969 193 0.830[0.773,0.887] 618 193 0.746[0.661,0.831]2008–2012 2,082 198 0.805[0.717,0.894] 682 198 0.760[0.665,0.855]2009–2013 2,212 177 0.852[0.783,0.922] 736 177 0.786[0.696,0.876]2010–2014 2,289 173 0.815[0.691,0.938] 791 173 0.803[0.666,0.940]

    Musculoskeletal2004–2008 1,361 35 0.765[0.597,0.933] 96 35 0.704[0.512,0.896]2005–2009 1,562 38 0.716[0.489,0.944] 109 38 0.674[0.472,0.876]2006–2010 1,764 35 0.634[0.439,0.830] 111 35 0.509[0.276,0.742]2007–2011 1,969 37 0.737[0.571,0.903] 119 37 0.677[0.493,0.860]2008–2012 2,082 36 0.884[0.773,0.995] 127 36 0.683[0.462,0.904]2009–2013 2,212 26 0.792[0.573,1.000] 133 26 0.667[0.429,0.904]2010–2014 2,289 19 0.882[0.724,1.000] 128 19 0.882[0.706,1.000]

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page38of44©2019byLo,Siah,andWong,AllRightsReserved

    Table17.Comparisonof thegeneraland indication-groupspecific classifiers for selected indicationgroups inP3APP.Weusebootstrappingtodeterminethe95%CIforAUC.

    GeneralClassifier SpecializedClassifiers TrainSet TestSet AUC[95%CI] TrainSet TestSet AUC[95%CI]

    All2004–2008 472 196 0.769[0.704,0.834] - - -2005–2009 559 177 0.724[0.650,0.798] - - -2006–2010 604 211 0.738[0.671,0.805] - - -2007–2011 664 174 0.806[0.740,0.871] - - -2008–2012 677 197 0.827[0.768,0.886] - - -2009–2013 740 153 0.868[0.809,0.927] - - -2010–2014 734 110 0.876[0.811,0.941] - - -

    Anti-cancer2004–2008 472 34 0.773[0.618,0.928] 95 34 0.684[0.495,0.874]2005–2009 559 28 0.740[0.543,0.936] 107 28 0.568[0.345,0.791]2006–2010 604 50 0.754[0.599,0.910] 110 50 0.630[0.452,0.809]2007–2011 664 24 0.587[0.333,0.842] 132 24 0.392[0.132,0.651]2008–2012 677 40 0.793[0.549,1.000] 134 40 0.668[0.457,0.879]2009–2013 740 29 0.800[0.480,1.000] 151 29 0.775[0.528,1.000]2010–2014 734 26 0.943[0.842,1.000] 153 26 0.852[0.558,1.000]

    RareDiseases2004–2008 472 22 0.711[0.465,0.957] 54 22 0.620[0.364,0.876]2005–2009 559 23 0.735[0.517,0.952] 60 23 0.606[0.360,0.852]2006–2010 604 24 0.888[0.747,1.000] 66 24 0.825[0.645,1.000]2007–2011 664 22 0.838[0.652,1.000] 72 22 0.735[0.520,0.950]2008–2012 677 34 0.893[0.780,1.000] 76 34 0.700[0.523,0.877]2009–2013 740 28 0.962[0.899,1.000] 94 28 0.932[0.840,1.000]2010–2014 734 18 0.908[0.766,1.000] 109 18 0.985[0.942,1.000]

    For comparison, we disaggregate performance by indication group. We find that these

    classifiersdonotloseouttotheirspecializedcounterparts.Infact,ourresultsshowthatthe

    formertendtoexhibitmorestableperformanceacrossthesevenwindows,particularlyon

    indicationgroupswithsmallsamplesizes.Wehypothesizethatclassifierstrainedonalldata

    benefit fromhavingaccess to largerdatasetswithgreaterdiversity, andare thusable to

    makemoreinformedpredictions.Thissuggeststhatitmaybemoreappropriatetorelyon

    generalclassifiers,ratherthanspecializedones,forpredictionsovertimewheresamplesare

    spreadoutovermultiplewindows,sincefurtherfilteringbyindicationgroupresultsineven

    smallersamplesizes.

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page39of44©2019byLo,Siah,andWong,AllRightsReserved

    Finally,weextractthetopfiveP2APPpipelinedrugcandidateswiththehighestscoresin

    each indication group as predicted by the 2011–2015 rolling-window model. Table 18

    summarizestheresults.Weincludeonlycandidatesthatarestilloutstandingatthetimeof

    writing(neitherdiscontinuednorapproved).Itisencouragingthatmanyofthesecandidates

    (indicatedinitalics)haveadvancedbeyondphase2testingsinceouranalysis,indicatingthe

    predictivepowerofourmodels.WeincludeaninteractiveversioninFig12wherereaders

    canfilterourpipelinepredictionsbyindicationgroupandprobabilityofapproval.Ultimately,

    allbiopharmastakeholderscanusesuchscorestorankandevaluatethepotentialrisksand

    rewardsofdrugcandidates.

    Fig12.NetworkgraphofP2APPpipelinedrugcandidates.Blacknodescorrespondtoindicationgroups.Colorednodescorrespondtodrug-indicationpairs.Eachdrug-indicationpairnodeisconnectedtoitsparentindicationgroupandalsootherdrug-indicationpairs thathave thesame indication.Theyarecoloredaccordingto theirrespectiveprobabilityofapprovalaspredictedbyourmodel—blueforhigherscoresandredforlowerscores.Hoverovernodesfordetailsofeachdrug-indicationpair.Blackindicationgroupnodesaresizedbasedonthenumberofconnections.

    Theinformationpresentedthroughthisresearchandintheincludedfigurearemadeavailablesolelyforgeneralinformationalpurposes.Theauthorsdonotwarranttheaccuracy,completenessorusefulnessofthisinformation.Anyrelianceyouplaceonsuchinformationisstrictlyatyourownrisk.Theauthorsexpresslydisclaimallliabilityandresponsibilityarisingfromanyrelianceplacedonsuchinformationbyyou,orbyanyonewhomaybeinformedofthisinformation.

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page40of44©2019byLo,Siah,andWong,AllRightsReserved

    Table18.TopfiveP2APPpipelinedrugcandidateswiththehighestscoresineachindicationgroupaspredictedbyourmodel.Weincludeonlycandidatesthatarestilloutstandingatthetimeofwriting(neitherdiscontinuednorapproved).Drug-indicationpairs in italicsare those thathaveadvancedbeyondphase2 testingsinceouranalysis.

    Drug Indication Score Drug Indication ScoreAnti-cancer Musculoskeletalontecizumab Cancer,colorectal 0.34 tofacitinib Arthritis,psoriatic 0.31

    calmangafodipirRadio/chemotherapy-inducedinjury,bonemarrow,neutropenia

    0.31 ixekizumab Arthritis,rheumatoid 0.31

    tivantinib Cancer,sarcoma,softtissue 0.30 anti-BLyS/APRILantibodyfusionprotein Arthritis,rheumatoid 0.31

    pidilizumab Cancer,colorectal 0.29 sirukumab Arthritis,rheumatoid 0.29NK-012 Cancer,colorectal 0.28 romosozumab Osteoporosis 0.28RareDiseases Cardiovascularsurotomycin Infection,Clostridiumdifficile 0.34 K-134 Peripheralvasculardisease 0.37tivantinib Cancer,sarcoma,softtissue 0.30 nitricoxide,inhaled Hypertension,pulmonary 0.29

    VP-20621 Infection,Clostridiumdifficileprophylaxis 0.30 TY-51924 Infarction,myocardial 0.28

    KHK-7580 Secondaryhyperparathyroidism 0.29 s-amlodipine+telmisartan Hypertension,unspecified 0.27nitricoxide,inhaled Hypertension,pulmonary 0.29 tirasemtiv Peripheralvasculardisease 0.24Neurological Dermatological

    dasotraline Attentiondeficithyperactivitydisorder 0.35 tofacitinib Arthritis,psoriatic 0.31

    idalopirdine Alzheimer'sdisease 0.35 dimethylfumarate Psoriasis 0.27GRC-17536 Neuropathy,diabetic 0.34 pefcalcitol Psoriasis 0.24caprylictriglyceride Alzheimer'sdisease 0.32 Benvitimod Psoriasis 0.22

    levodopa Parkinson'sdisease 0.31 calcipotriolmonohydrate+betamethasonedipropionate

    Psoriasis 0.22

    Alimentary Genitourinary

    ibodutant Irritablebowelsyndrome,diarrhoea-predominant 0.37

    etonogestrel+estradiol(vaginalring),nextgeneration

    Contraceptive,female 0.30

    GRC-17536 Neuropathy,diabetic 0.34 drospirenone+estradiol Contraceptive,female 0.28mesalazine+N-acetylcysteine Colitis,ulcerative 0.31

    finerenone Nephropathy,diabetic 0.27

    apabetalone(tablet) Diabetes,Type2 0.31 afacifenacinfumarate Overactivebladder 0.26phosphatidylcholine Colitis,ulcerative 0.31 GKT-137831 Nephropathy,diabetic 0.26Immunological BloodandClotting

    tofacitinib Arthritis,psoriatic 0.31 calmangafodipirRadio/chemotherapy-inducedinjury,bonemarrow,neutropenia

    0.31

    ixekizumab Arthritis,rheumatoid 0.31 balugrastim Radio/chemotherapy-inducedinjury,bonemarrow,neutropenia 0.27

    anti-BLyS/APRILantibodyfusionprotein Arthritis,rheumatoid 0.31

    eflapegrastim Radio/chemotherapy-inducedinjury,bonemarrow,neutropenia 0.25

    sirukumab Arthritis,rheumatoid 0.29 pegfilgrastim Radio/chemotherapy-inducedinjury,bonemarrow,neutropenia 0.22

    dimethylfumarate Psoriasis 0.27 lexaptepidpegol Radio/chemotherapy-inducedanaemia 0.20

    Anti-infective Sensory

    delafloxacin Infection,skinandskinstructure,acutebacterial 0.39 AR-13324+latanoprost Glaucoma 0.27

    surotomycin Infection,Clostridiumdifficile 0.34 S-646240 Maculardegeneration,age-related,wet 0.27

    delafloxacin Infection,pneumonia,community-acquired 0.33 netarsudil Glaucoma 0.26

    plazomicin Infection,urinarytract,complicated 0.33 fenofibrate,micronized-2 Oedema,macular,diabetic 0.25

    Ypeginterferonalpha-2b Infection,hepatitis-Cvirus 0.33 LX-7101 Glaucoma 0.21Respiratory Hormonalfluticasone+salmeterol Asthma 0.36 KHK-7580 Secondaryhyperparathyroidism 0.29fluticasonefuroate+umeclidinium+vilanterol

    Chronicobstructivepulmonarydisease 0.36

    somatropinprodrug,pegylated Growthhormonedeficiency 0.21

    fluticasonefuroate+umeclidinium

    Chronicobstructivepulmonarydisease 0.36

    2MD Secondaryhyperparathyroidism 0.21

    beclometasone+formoterol

    Chronicobstructivepulmonarydisease 0.35

    velcalcetide Secondaryhyperparathyroidism 0.19

    fluticasonepropionateDPI Asthma 0.35 tesamorelinacetate Growthhormonedeficiency 0.18

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page41of44©2019byLo,Siah,andWong,AllRightsReserved

    4 DiscussionDrugdevelopmentisanextremelycostlyprocessandtheaccurateevaluationofacandidate

    drug’s likelihood of approval is critical to the efficient allocation of capital. Historical

    successes and failures contain valuable insights on the characteristics of high-potential

    candidates. Unfortunately, such data are often incomplete due to partial reporting by

    investigatorsanddevelopers.Mostanalyticmethodsrequirecompletedata,however,and

    priorstudiesonestimatingapprovalratesandpredictingapprovalsaretypicallybasedona

    smallnumberofexamplesthathavecompleteinformationforjustafewfeatures.

    In thispaper,we extract twodatasets, P2APP andP3APP, from Informa®databases and

    apply5NNstatisticalimputationtomakeefficientuseofallavailabledata.Weusemachine-

    learningtechniquestotrainandvalidateourRFpredictivemodelsandachievepromising

    levelsofpredictivepowerforbothdatasets.Whenappliedtopipelinedrugs,wefindthat

    candidateswithhigherscoresareindeedmorelikelytoadvancetohigherclinicalphases,

    indicating that our 5NN-RF classifiers are able to discriminate between high- and low-

    potentialcandidates.

    Atime-seriesanalysisofthedatasetsshowsgenerallyincreasingtrendsinperformanceover

    five-year rolling windows from 2004 to 2014.We find that the classifiers’ performance

    correlateswellwiththeproportionsofcompletecasesinthetrainingsets:ascompleteness

    increases, the classifier learns better and achieves higher AUCs. This highlights the

    importance of data quality in building more accurate predictive algorithms for drug

    development.

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page42of44©2019byLo,Siah,andWong,AllRightsReserved

    Finally,wecomputefeatureimportanceinthepredictivemodelsandfindthattrialoutcomes,

    trialstatus,trialaccrualrates,duration,priorapprovalforanotherindication,andsponsor

    track records are the most critical features for predicting success. Because the 5NN-RF

    classifiersarenon-linear,thereisnosimpleinterpretationoftheincrementalcontribution

    ofeachpredictortotheforecast.However,theintuitionbehindsomeofthesefactorsisclear:

    drug-indication pairs with trials that achieve positive outcomes certainly have a better

    chance of approval; candidates sponsored by companies with strong track records and

    greater expertise in drug development should have higher likelihood of success; and

    approveddrugsmayhavehigherchancesofapprovalforasecondrelatedindication.Many

    ofthesefactorscontainusefulsignalsaboutdrugdevelopmentoutcomesbuthavenotbeen

    consideredinpriorstudies.

    These results are promising and raise the possibility of even more powerful drug

    developmentpredictionmodelswithaccess tobetterqualitydata.This canbedrivenby

    programssuchasProjectDataSphere(Greenetal.,2015)andVivli(Bierer,Li,Barnes,&Sim,

    2016)thatpromoteandfacilitatepublicsharingofpatient-levelclinicaltrialdata.Ultimately,

    suchpredictiveanalyticscanbeusedtomakemoreinformeddata-drivendecisionsinthe

    riskassessmentandportfoliomanagementofinvestigationaldrugsatallclinicalstages.

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page43of44©2019byLo,Siah,andWong,AllRightsReserved

    ReferencesBierer,B.E.,Li,R.,Barnes,M.,&Sim,I.(2016).Aglobal,neutralplatformforsharingtrialdata.New

    EnglandJournalofMedicine,374(25),2411-2413.

    Breiman,L.,Friedman,J.,Stone,C.J.,&Olshen,R.A.(1984).Classificationandregressiontrees.CRCPress.

    Collins,L.M.,Schafer,J.L.,&Kam,C.M.(2001).Acomparisonofinclusiveandrestrictivestrategiesinmodernmissingdataprocedures.PsychologicalMethods,6(4),330.

    DiMasi,J.A.,Hermann,J.C.,Twyman,K.,Kondru,R.K.,Stergiopoulos,S.,Getz,K.A.,&Rackoff,W.(2015). A tool for predicting regulatory approval after phase II testing of new oncologycompounds.ClinicalPharmacology&Therapeutics,98(5),506-513.

    El-Maraghi, R.H., & Eisenhauer, E. A. (2008).Review of phase II trial designs used in studies ofmolecular targetedagents:outcomesandpredictorsofsuccess inphase III. JournalofClinicalOncology,26(8),1346-1354.

    Enders,C.K.(2010).Appliedmissingdataanalysis.GuilfordPress.

    Fawcett,T.(2006).AnintroductiontoROCanalysis.PatternRecognitionletters,27(8),861-874.

    Fernandez, J. M., Stein, R. M., & Lo, A.W. (2012). Commercializing biomedical research throughsecuritizationtechniques.NatureBiotechnology,30(10),964-975.

    Goffin,J.,Baral,S.,Tu,D.,Nomikos,D.,&Seymour,L.(2005).Objectiveresponsesinpatientswithmalignant melanoma or renal cell cancer in early clinical studies do not predict regulatoryapproval.ClinicalCancerResearch,11(16),5928-5934.

    Green,A.K.,Reeder-Hayes,K.E.,Corty,R.W.,Basch,E.,Milowsky,M.I.,Dusetzina,S.B.,Bennett,A.V.,&Wood,W.A.(2015).Theprojectdatasphereinitiative:acceleratingcancerresearchbysharingdata.TheOncologist,20(5),464-e20.

    Informa-PharmaceuticalClinicalTrialIntelligenceProducts.(2016).Informa.Retrieved5December2016, from https://pharmaintelligence.informa.com/products-and-services/data-and-analysis/citeline-joins-informas-pharma-intelligence

    Jardim,D.L.,Groves,E.S.,Breitfeld,P.P.,&Kurzrock,R.(2017).Factorsassociatedwithfailureofoncology drugs in late-stage clinical development: A systematic review. Cancer TreatmentReviews,52,12-21.

    Kuhn,M.,Weston,S.,Coulter,N.,&Quinlan,R.(2014).C50:C5.0decisiontreesandrule-basedmodels.Rpackageversion0.1.0-21.

    Malik,L.,Mejia,A.,Parsons,H.,Ehler,B.,Mahalingam,D.,Brenner,A.,SarantopoulosJ.,&Weitman,S.(2014).PredictingsuccessinregulatoryapprovalfromPhaseIresults.CancerChemotherapyandPharmacology,74(5),1099-1103.

    Pedregosa,F.,Varoquaux,G.,Gramfort,A.,Michel,V.,Thirion,B.,Grisel,O.,BlondelM.,PrettenhoferP.,WeissR.,DubourgV.,&Vanderplas,J.(2011).Scikit-learn:MachinelearninginPython.JournalofMachineLearningResearch,12(Oct),2825-2830.

    Rubin,D.B.(1976).Inferenceandmissingdata.Biometrika,63(3),581-592.

    Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American StatisticalAssociation,91(434),473-489.

  • HDSR

    Issue1

    19May2019 MachineLearningPredictionsofDrugApprovals Page44of44©2019byLo,Siah,andWong,AllRightsReserved

    Scannell, J., Blanckley, A., Boldon, H. & Warrington, B. (2012). Diagnosing the decline inpharmaceuticalR&Defficiency.NatureReviewsDrugDiscovery11,191–200.

    Schafer,J.L.(1997).Analysisofincompletemultivariatedata.CRCPress.

    Schafer, J.L.,&Graham, J.W. (2002).Missingdata:ourviewof thestateof theart.PsychologicalMethods,7(2),147-177.

    VanBuuren,S.(2012).Flexibleimputationofmissingdata.CRCPress.

    Zarin,D.A.,Tse,T.,Williams,R.J.,&Carr,S.(2016).Trialreportinginclinicaltrials.gov—thefinalrule.NewEnglandJournalofMedicine,375(20),1998-2004.