machine learning with statistical imputation for ......approval dataset includes more than 6,000...

46
HDSR Issue 1 Machine learning with statistical imputation for predicting drug approvals Andrew W. Lo 1,2,3 *, Kien Wei Siah 1,2 , Chi Heem Wong 1,2 1 Laboratory for Financial Engineering, Sloan School of Management, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America 2 EECS and CSAIL, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America 3 Santa Fe Institute, Santa Fe, New Mexico, United States of America * Corresponding author email: [email protected] (A.W.L.) Abstract We apply machine-learning techniques to predict drug approvals using drug-development and clinical-trial data from 2003 to 2015 involving several thousand drug-indication pairs with over 140 features across 15 disease groups. To deal with missing data, we use imputation methods that allow us to fully exploit the entire dataset, the largest of its kind. We show that our approach outperforms complete-case analysis, which typically yields biased inferences. We achieve predictive measures of 0.78, and 0.81 AUC (“area under the receiver operating characteristic curve,” the estimated probability that a classifier will rank a positive outcome higher than a negative outcome) for predicting transitions from phase 2 to approval and phase 3 to approval, respectively. Using five-year rolling windows, we document an increasing trend in the predictive power of these models, a consequence of improving data quality and quantity. The most important features for predicting success are trial outcomes, trial status, trial accrual rates, duration, prior approval for another indication, and sponsor track records. We provide estimates of the probability of success for all drugs in the current pipeline. Keywords: Drug development; Clinical trials; Healthcare; Machine learning; Imputation We thank Informa for providing us access to their data and expertise and are particularly grateful to Christine Blazynski, Mark Gordon, and Michael Hay for many helpful comments and discussion throughout this project. We also thank them and Anna Barker, Linda Blackerby, Lara Boro, Steve Finch, Howard Fingert, Xiao-Li Meng, Ellen Moore, Jim Reddoch, Lara Sullivan, Marty Tenenbaum, James Wade, and three anonymous referees for specific comments on this manuscript, and Jayna Cummings and Paige Sammartino for editorial assistance. Research support from the MIT Laboratory for Financial Engineering is gratefully acknowledged. The views and opinions expressed in this article are those of the authors only, and do not necessarily represent the views and opinions of any institution or agency, any of their affiliates or employees, or any of the individuals acknowledged above.

Upload: others

Post on 22-May-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

Machine learning with statistical imputation forpredictingdrugapprovals‡AndrewW.Lo1,2,3*,KienWeiSiah1,2,ChiHeemWong1,21LaboratoryforFinancialEngineering,SloanSchoolofManagement,MassachusettsInstituteofTechnology,Cambridge,Massachusetts,UnitedStatesofAmerica2EECSandCSAIL,MassachusettsInstituteofTechnology,Cambridge,Massachusetts,UnitedStatesofAmerica3SantaFeInstitute,SantaFe,NewMexico,UnitedStatesofAmerica*Correspondingauthor

email:[email protected](A.W.L.)

AbstractWeapplymachine-learningtechniquestopredictdrugapprovalsusingdrug-developmentandclinical-trialdatafrom2003to2015involvingseveralthousanddrug-indicationpairswith over 140 features across 15 disease groups. To deal with missing data, we useimputationmethodsthatallowustofullyexploittheentiredataset,thelargestofitskind.We show that our approach outperforms complete-case analysis, which typically yieldsbiasedinferences.Weachievepredictivemeasuresof0.78,and0.81AUC(“areaunderthereceiveroperatingcharacteristiccurve,”theestimatedprobabilitythataclassifierwillrankapositiveoutcomehigherthananegativeoutcome)forpredictingtransitionsfromphase2to approval and phase 3 to approval, respectively. Using five-year rolling windows, wedocumentan increasingtrend inthepredictivepowerof thesemodels,aconsequenceofimprovingdataqualityandquantity.Themostimportantfeaturesforpredictingsuccessaretrialoutcomes,trialstatus,trialaccrualrates,duration,priorapprovalforanotherindication,andsponsortrackrecords.Weprovideestimatesoftheprobabilityofsuccessforalldrugsinthecurrentpipeline.

Keywords:Drugdevelopment;Clinicaltrials;Healthcare;Machinelearning;Imputation

‡WethankInformaforprovidingusaccesstotheirdataandexpertiseandareparticularlygratefultoChristineBlazynski,MarkGordon,andMichaelHayformanyhelpfulcommentsanddiscussionthroughoutthisproject.WealsothankthemandAnnaBarker,LindaBlackerby,LaraBoro,SteveFinch,HowardFingert,Xiao-LiMeng,EllenMoore,JimReddoch,LaraSullivan,MartyTenenbaum,JamesWade,andthreeanonymousrefereesforspecificcommentsonthismanuscript,andJaynaCummingsandPaigeSammartinoforeditorialassistance.ResearchsupportfromtheMITLaboratoryforFinancialEngineeringisgratefullyacknowledged.Theviewsandopinionsexpressedinthisarticlearethoseoftheauthorsonly,anddonotnecessarilyrepresenttheviewsand opinions of any institution or agency, any of their affiliates or employees, or any of the individualsacknowledgedabove.

Page 2: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1Author contributions: Conceptualization, data-license agreement, A.W.L.; Methodology,A.W.L.,K.W.S.andC.H.W.;Investigation,K.W.S.;Writing(firstdraft),K.W.S.;Writing(review,editing,finaldraft),K.W.S.andC.H.W.;Visualization,K.W.S.;Projectsupervision,A.W.L.

Page 3: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page1of44©2019byLo,Siah,andWong,AllRightsReserved

1 IntroductionWhilemanyrecentmedicalbreakthroughssuchasimmuno-therapies,genetherapies,and

gene-editing techniques offer new hope for patients, they have also made biomedical

innovationriskier,andmorecomplexandexpensive.Thesebreakthroughsgeneratenovel

therapiesforinvestigation,eachofwhichrequiresmanyyearsoftranslationalresearchand

clinicaltesting,costinghundredsofmillionstobillionsofdollarsandyetoftenfaceahigh

likelihoodoffailure(Fernandez,Stein,&Lo,2012).Infact,drugdevelopmentproductivity—

theratioofthenumberofnewdrugsapprovedtoR&Dspendingeachyear—hasdeclined

steadilyoverthepast50yearsdespitescientificandtechnicalprogress.Thisphenomenon,

whichScannell,Blanckley,Boldon, andWarrington (2012) termed “Eroom’sLaw,” as the

reverse of Moore’s Law, suggests that the cost of developing new drugs has doubled

approximatelyeverynineyearssincethe1950s. In the faceofmultipleuncertainties, the

needtoevaluatedrugcandidatesbetterandallocatecapitaltohigh-potentialopportunities

moreefficientlyhasonlyintensified.

Toaddresstheseneeds,inthisarticleweapplymachine-learningtechniquestopredictthe

outcomesofrandomizedclinicaltrials.Machinelearningisaninterdisciplinaryfieldfocused

on tackling pattern recognition problems and building predictive models tomake data-

driven decisions, which is well-suited for this context. Successful applications of these

techniqueshavealreadyrevolutionizedanumberofindustries(e.g.,advertising,marketing,

financeandinsurance,oilandgasexploration)andarepoisedforevengreaterimpactvia

autonomousvehicles,facial-recognitionauthentication,andgeneral-purposerobotics.

Page 4: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page2of44©2019byLo,Siah,andWong,AllRightsReserved

Drugdevelopershavealreadyappliedmachine-learningtoolstothediscoveryprocessvia

high-throughputscreeningofvastlibrariesofchemicalandbiologicalcompoundstoidentify

drug targets. However, in managing their portfolios of investigational drugs, biopharma

companies typically use unconditional estimates of regulatory approval rates based on

historically observed relative frequencies. Machine learning techniques yield conditional

estimates of success, conditioned on a host of predictive factors known to affect the

likelihood of approval, including drug compound characteristics, clinical trial design,

previoustrialoutcomes,andthesponsortrackrecord.Weshowthatthesefeaturescontain

usefulsignalsaboutdrugdevelopmentoutcomesthatwillallowustoforecasttheoutcome

ofpipelinedevelopmentsmoreaccurately.

Ourmethodologyandresultshaveseveralimplicationsforstakeholdersinthebiomedical

ecosystem.Moreaccurateforecastsofthelikelihoodofsuccessofclinicaltrialswillreduce

theuncertaintysurroundingdrugdevelopment,whichwillincreasetheamountofcapital

thatinvestorsanddrugdevelopersarewillingtoallocatetothisendeavor.Byextension,this

would lower the cost of capital and increase the efficiency of the allocation process.

Specifically, we predict the probability of success of drug candidates in two scenarios:

(1)advancing from phase 2 to regulatory approval and (2) from phase 3 to regulatory

approval(seeFig1).Investorsanddrugdevelopersmayusesuchpredictionstoevaluatethe

risksofdifferentinvestigationaldrugsatdifferentclinicalstages,providingthemwithmuch-

needed transparency. Greater risk transparency is one source of improved financial

efficiencybecauseitfacilitatesmoreaccuratematchingofinvestorriskpreferenceswiththe

risksofbiomedicalinvestmentopportunities.

Page 5: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page3of44©2019byLo,Siah,andWong,AllRightsReserved

Machine-learningmodels can also offer guidance to scientists, clinicians, and biopharma

professionalsas towhich factorsaremost important indeterminingclinical-trialsuccess,

suggesting ways to improve the drug development process and decelerate or reverse

Eroom’sLaw.

Policymakers and regulators would also benefit from machine-learning predictions,

particularlyfordrug-indicationpairsthatarepredictedtofailwithhighlikelihood—these

cases highlight themostdifficult challenges in biomedicine and underscore the need for

greatergovernmentandphilanthropicsupport.

Fig1.Predictivemodelsforassessingtheprobabilityofapprovalofdrugcandidatesintwoscenarios:(1)afterphase2testing,and(2)afterphase3testing.

Tothebestofourknowledge,ourstudyisthelargestofitskind.Weconstructtwodatasets,

one for each scenario, from two proprietary pharmaceutical pipeline databases,

Pharmaprojects and Trialtrove provided by Informa® (Informa, 2016). The phase-2-to-

approvaldatasetincludesmorethan6,000uniquedrugsfor288indicationsandover14,500

phase2trials,andthephase-3-to-approvaldatasetcontainsmorethan1,400uniquedrugs

for253indicationsandover4,500phase3trials.Thesedatacoverover15indicationgroups.

Incontrast,mostpublishedresearchondrugapprovalpredictionhaveverysmallsample

Page 6: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page4of44©2019byLo,Siah,andWong,AllRightsReserved

sizes,areconcentratedonspecifictherapeuticareas,andinvolveonlyoneorasmallnumber

of predictive factors: Malik et al. (2014) examined the trial objective responses of 88

anticanceragentsinphase1;Goffin,Baral,Tu,Nomikos,andSeymour(2005)studiedthe

tumorresponseratesof58cytotoxicagentsin100phase1trialsand46agentsin499phase

2trials;El-MaraghiandEisenhauer(2008)lookedattheobjectiveresponsesof19phase2

anticancerdrugs in89singleagent trials; Jardim,Groves,Breitfeld, andKurzrock (2017)

examinedtheresponseratesof80phase3oncologydrugstoidentifyfactorsassociatedwith

failures;andDiMasietal.(2015)analyzed62cancerdrugsandproposedanapprovednew

drug index (ANDI)algorithmwith four factors topredict approval for lead indications in

oncology after phase 2 testing (see SupplementaryMaterials H for a comparison of our

analysistotheirs).

Anotherkeydifferenceinourapproachisthatwedealwithmissingdatausingstatistical

imputation methods. We explore four common approaches to “missingness” and

demonstratetheiradvantagesanddisadvantagesoverdiscardingincompletecases.Withthe

FDA Amendments Act of 2007, drug and clinical trial data collection has been rapidly

expanding,butthesedataareoftensparse,andourdatasetisnoexception.Relatedstudies

(e.g.,DiMasietal.,2015)havetypicallyusedonlycomplete-caseobservations—discarding

clinicaltrialswithanymissinginformation—whichtypicallyeliminateslargeportionsofthe

dataandmayalsoleadtocertainbiases.

Weusemachine-learningtechniquestoformourpredictions,includingcross-validationfor

trainingandaheld-outtestingsetforperformanceevaluation,andusethestandard“area

under the receiver operating characteristic curve” (AUC) metric to measure model

Page 7: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page5of44©2019byLo,Siah,andWong,AllRightsReserved

performance(AUCistheestimatedprobabilitythataclassifierwillrankapositiveoutcome

higherthananegativeoutcome[Fawcett,2006]).WeachieveAUCsof0.78 forpredicting

phase2toapproval(95%confidenceinterval(CI):[0.75,0.81])and0.81forpredictingphase

3toapproval(95%CI:[0.78,0.83]).Atime-series,walk-forwardanalysisapproachshows

similarresults.Wealsoapplyourmodelstothecurrentdrugpipeline—thatis,alldrugsstill

indevelopmentasoftheendofourdataset—toidentifythecandidatesthathavethehighest

and lowestprobabilitiesofsuccess.Weexaminethe latestdevelopmentstatusesof these

pipelinedrug-indicationpairs—atrue“out-of-sample”experiment(validationondatanot

used inmodel building)—and find that candidateswith higher scores are, indeed,more

likelytoprogresstolaterclinicalstages.Thisindicatesthatourclassifiersdodiscriminate

betweenhigh-andlow-potentialcandidates.

2 MaterialsandmethodsDataThecommercialdatavendorInforma®offerstwodatabasesthatareusedinouranalysis:

Pharmaprojects,whichspecializesindruginformation,andTrialtrove,whichspecializesin

clinical trials information (Informa,2016).These twodatabasesaggregatedrugand trial

informationfromover30,000datasourcesinmorethan150countries,includingcompany

press releases, government drug databases (e.g., Drugs@FDA) and trial databases (e.g.,

Clinicaltrials.gov [Zarin, Tse, Williams, & Carr, 2016], Clinicaltrialsregister.eu [extracted

from EudraCT]), and scientific conferences and publications. Using these sources, we

constructtwodatasetsofdrug-indicationpairs:phase2toapproval(P2APP)andphase3to

approval(P3APP).WeextractclinicaltrialfeaturesfromTrialtrove,andaugmentthisdata

usingdrug features fromPharmaprojects.Applyingmachine-learningalgorithms to these

Page 8: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page6of44©2019byLo,Siah,andWong,AllRightsReserved

datasetsallowsustoestimate:(1)whetheradrug-indicationpairthathasconcludedphase

2testingwillbeapprovedeventually;and(2)whetherapair thathasconcludedphase3

testingwillbeapprovedeventually.DatacleaningproceduresareoutlinedinSupplementary

MaterialsA.

Weconsiderallindicationsassociatedwithaparticulardrug,asopposedtoonlythelead

indication.Weextract all features that could conceivably correlatewith the likelihoodof

success, from drug compound attributes (31 features from Pharmaprojects profiles) to

clinical trial characteristics (113 features fromTrialtrove). These features are defined in

Table1andSupplementaryMaterialsA.Ingeneral,eachdatasetmaybepartitionedintotwo

disjointsubsets:onewithsamplesthathaveknownoutcomes,andanotherwithsamples

thatarestillinthepipelineatthetimeofsnapshotofthedatabases(thatis,theoutcomes

areunknown).Toprovideintuitionforthecharacteristicsofthesamples,wedescribekey

summarystatisticsofeachsubset.

TheP2APPdatasetconsistsof6,344drug-indicationpairsthathaveendedphase2testing;

thatis,therearenophase2trialsinprogressorplannedinthedatabase.Thephase2trials

inthisdatasetrangefromAugust8,1990toDecember15,2015.Inoursample,4,812pairs

haveknownoutcomes,while1,532pairsarestillinthepipeline.Inthesubsetwithknown

outcomes, we define the development statuses of suspension, termination, and lack of

developmentas“failures”(86.8%),andregistrationandlaunchas“successes”orapprovals

(13.2%).TheP3APPdatasetconsistsof1,870pairsthathaveendedphase3testing,ofwhich

1,610pairshaveknownoutcomes,while260pairsarestillinthepipeline.Forthosepairs

withknownoutcomes,wedefine“failures”(59.1%)and“successes”(40.9%) inthesame

Page 9: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page7of44©2019byLo,Siah,andWong,AllRightsReserved

fashionas theP2APPdataset.Thephase3 trials inP3APPspan from January1,1988 to

November1,2015.ThesefiguresaresummarizedinTable2.Here,theuseofterms“success”

and“failure”isinthecontextofachievingapproval.Wenotethatourdefinitionof“failures”

canincludedrugdevelopmentprogramsthatareterminatedduetofactorsunrelatedtothe

performanceofthedrug(e.g.,marketconditions,businessdecisions).InSection3Results,

wefindthatthisoutcomevariablehassignificantassociationswithtrialperformanceand

otherfactors.

Thedatasetscover15indicationgroups:alimentary,anti-infective,anti-parasitic,bloodand

clotting, cardiovascular, dermatological, genitourinary, hormonal, immunological,

musculoskeletal,neurological,anti-cancer,rarediseases,respiratory,andsensoryproducts.

Anti-canceragentsmakeupthelargestsubgroupinP2APP,andthesecondlargestinP3APP

(seeTable3).Industry-sponsoredtrialsdominatebothdatasets(seeTable4).Inaggregate,

weobserveadecreasingtrendinsuccessratesoverfive-yearrollingwindowsfrom2003to

2015(seeFig2).

To the best of our knowledge, this sample is the largest of its kind. All prior published

researchinthisliteratureinvolvedfewerthan100drugsor500trials(DiMasietal.,2015;

El-Maraghi & Eisenhauer, 2008; Goffin et al., 2005; Malik et al., 2014). In addition, our

datasetscoveradiversesetofindicationgroups,asopposedtoasingleareasuchasoncology.

Page 10: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page8of44©2019byLo,Siah,andWong,AllRightsReserved

Table1.DescriptionofparentfeaturesextractedfromPharmaprojectsandTrialtrove.Someparentfeaturesaremulti-label(e.g.,atrialmaybetaggedwithUnitedStatesandUnitedKingdomsimultaneously).Wetransformallmulti-labelparentfeaturesintobinarychildfeatures(1or0).SeeSupplementaryMaterialsAforspecificexamplesofeachfeature.Notethatdrug-indicationpairsforthesamedrughavethesamedrugfeatures;drug-indicationpairsinvolvedinthesametrialhavethesametrialfeatures.

Description TypeDrugFeaturesRoute Routeofadministrationofthedrug,thepathbywhichthedrugistaken

intothebody.Multi-label

Origin Originoftheactiveingredientinthedrug. Multi-labelMedium Mediumofthedrug. Multi-labelBiologicaltargetfamily Familyofproteinsinthebodywhoseactivityismodifiedbythedrug,

resultinginaspecificeffect.Multi-label

Pharmacologicaltargetfamily

Mechanismofactionofthedrug,thebiochemicalinteractionthroughwhichthedrugproducesitspharmacologicaleffect.

Multi-label

Drug-indicationdevelopmentstatus

Currentphaseofdevelopmentofthedrugfortheindication. Binary

Priorapprovalofdrugforanotherindication

Approvalofthedrugforanotherindicationpriortotheindicationunderconsideration(specifictodrug-indicationpair).

Binary

TrialFeaturesDuration Durationofthetrial(fromreportedstartdatetoenddate)indays. ContinuousStudydesign Designofthetrial(keywords). Multi-labelSponsortype Sponsorsofthetrialgroupedbytypes. Multi-labelTherapeuticarea Therapeuticareastargetedbythetrial. Multi-labelTrialstatus Statusofthetrial. BinaryTrialoutcome Resultsofthetrial. Multi-labelTargetaccrual Targetpatientaccrualofthetrial. ContinuousActualaccrual Actualpatientaccrualofthetrial. ContinuousLocations Locationsofthetrialbycountry. Multi-labelNumberofidentifiedsites Numberofsiteswherethetrialwasconducted. ContinuousBiomarkerinvolvement Typeofbiomarkerinvolvementinthetrial. Multi-labelSponsortrackrecord Sponsor’ssuccessindevelopingotherdrugspriortothedrug-indication

pairunderconsideration.Continuous

Investigatorexperience Primaryinvestigator’ssuccessindevelopingotherdrugspriortothedrug-indicationpairunderconsideration.

Continuous

Page 11: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page9of44©2019byLo,Siah,andWong,AllRightsReserved

Table2.SamplesizesofP2APPandP3APPdatasets.Weconsiderphase2trialinformationinP2APPdatasetsandphase3trialinformationinP3APPdataset.

Counts Drug-indication

PairsPhase2/3Trials

UniqueDrugs

UniqueIndications

UniquePhase2/3Trials

P2APPSuccess 635 2,563 540 173 2,486Failure 4,177 10,328 2,779 263 9,722Pipeline 1,532 2,815 1,189 221 2,713Total 6,344 15,706 4,073 288 14,584

P3APPSuccess 659 1,830 572 171 1,801Failure 951 2,425 764 203 2,360Pipeline 260 494 240 120 480Total 1,870 4,749 1,451 253 4,552

Table 3. Breakdown of drug-indication pairs by indication groups. A drug-indication pairmay havemultipleindicationgrouptags.Forinstance,renalcanceristaggedasbothanti-cancerandrarediseaseinPharmaprojects.

Count P2APP P3APPAll 6,344 1,870Anti-cancer 2,239 409RareDiseases 1,105 259Neurological 1,069 444Alimentary 757 249Immunological 474 101Anti-infective 493 177Respiratory 428 134Musculoskeletal 394 121Cardiovascular 388 158Dermatological 254 45Genitourinary 210 85BloodandClotting 160 97Sensory 137 41Hormonal 17 4Anti-parasitic 8 0

Table4.Breakdownoftrialsbysponsortypes.Atrialmaybesponsoredbymorethanoneparty(e.g.,collaborationbetweenindustrydevelopersandacademia).

Counts P2APP P3APPAll 14,584 4,552OtherPharma 5,432 1,721Top20Pharma 5,322 2,369Academic 4,869 736Government 1,807 314CooperativeGroup 958 230NotforProfit 181 51Generic 52 54ContractResearchOrganization 41 17

Page 12: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page10of44©2019byLo,Siah,andWong,AllRightsReserved

Fig2.SuccessratesinP2APPandP3APPoverfive-yearrollingwindowsfrom2003-2015.

MissingdataPriortothe2007FDAAmendmentsAct(FDAAA),itwasnotuncommonforinvestigatorsto

release only partial information about pipeline drugs and clinical trials to protect trade

secretsorsimplybecausetherewasnoincentivetodomore.Eventoday,someinvestigators

still do not adhere to the FDAAA-mandated registration policy or submit adequate

registrations.Therefore,allhistoricaldrugdevelopmentdatabaseshavemissingdata.We

notethatthe“missingness”hereislargelyrelatedtothepost-studyreportingofclinicaltrial

dataasopposedtoin-trialdatamissingness(e.g.,censorshipofpaneldataduetopatients

terminatingtrialparticipationprematurely).Intheformercase,thedata(e.g.,trialduration,

trialoutcomes)areusuallyavailabletotheinvestigatorsbutmaynotbereleasedpublicly,

and are thus considered “missing” from our standpoint. Therefore, our dataset may be

Page 13: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page11of44©2019byLo,Siah,andWong,AllRightsReserved

consideredanapproximationof theactualdataavailable to theFDAand individualdrug

developers.

Fig3,Fig4,Table5,andTable6summarizethepatternsofmissingnessinourdataset(we

excludepipelinedrug-indicationpairsherebecausetheiroutcomesarestillpending).The

missingdatapatternsaremultivariate.Whenconditionedonthelatestlevelofdevelopment,

foranyindication,wefindthatsuccessfuldrugsgenerallyhavelowerlevelsofmissingness

comparedtofaileddrugs.Forinstance,intheP2APPdataset,61%offaileddrugshavean

unknownmedium,while only 15%of approved drugs aremissing this feature.We also

observe that completed trials tend tohavegreater levelsofmissingness than terminated

trials.Betweentwodatasets,wefindthattheP3APPdataset,whichfocusesonphase3drugs

andtrials,generallyhaslessmissingdataforbothdrugandtrialfeaturesthantheP2APP

datasetwhichfocusesonphase2drugsandtrials.Thisisexpectedsincephase3trialsare

primarilyusedtosupportregistrationfilings.

Page 14: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page12of44©2019byLo,Siah,andWong,AllRightsReserved

Fig3.Missingnesspatternsofdrugfeatures.Eachrowcorrespondstoauniquedrug.Featuresnotincludedinthefigurearecompleteanddonothavemissingvalues.Abbreviations:Devstatus:highestlevelofdevelopmentofadrugforanyindication;Pharma:pharmacologicaltargetfamily;Bio:biologicaltargetfamily.

Fig4.Missingnesspatternsoftrialfeatures.Eachrowcorrespondstoauniqueclinicaltrial.Featuresnotincludedinthefigurearecompleteanddonothavemissingvalues.Abbreviations:Devstatus:highestlevelofdevelopmentof a drug for any indication; Status: trial status; Idsi: numberof identified sites; Actacc: actual accrual;Dura:duration;Taracc:targetaccrual;Loc:locations;Dkw:trialstudydesignkeywords;Outcome:trialoutcomes.

Page 15: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page13of44©2019byLo,Siah,andWong,AllRightsReserved

Table5.Missingnessindrugfeatureswithrespecttouniquedrugs(seeFig3).Thecolumnheading“Unconditional”referstooverallmissingnesswithoutconditioningonoutcome.

Missingness Unconditional Success Failure

P2APPRoute 0.04 0.00 0.04Pharmacologicaltargetfamily 0.06 0.02 0.07Biologicaltargetfamily 0.32 0.27 0.32Medium 0.53 0.15 0.61

P3APPRoute 0.01 0.00 0.02Pharmacologicaltargetfamily 0.03 0.02 0.04Biologicaltargetfamily 0.27 0.24 0.30Medium 0.35 0.14 0.54

Table6.Missingnessintrialfeatureswithrespecttouniquetrials(seeFig4).Thecolumnheading“Unconditional”referstooverallmissingnesswithoutconditioningontrialstatus.

Missingness Unconditional Completion Termination

P2APPNumberofidentifiedsites 0.10 0.10 0.10Actualaccrual 0.12 0.10 0.22Duration 0.26 0.29 0.05Targetaccrual 0.37 0.42 0.09Locations 0.02 0.02 0.02Studydesignkeywords 0.22 0.24 0.10Trialoutcomes 0.63 0.73 0.11

P3APPNumberofidentifiedsites 0.10 0.09 0.12Actualaccrual 0.12 0.09 0.26Duration 0.17 0.19 0.06Targetaccrual 0.27 0.31 0.09Locations 0.01 0.01 0.02Studydesignkeywords 0.09 0.09 0.06Trialoutcomes 0.53 0.62 0.07

Mostrelatedstudiesdonotreporttheextentofmissingdataintheirsamples,presumably

becausesmallerdatasetswereused.DiMasietal.(2015)reportedmissingdataforsomeof

theirfactors,andaddresseditthroughlistwisedeletion—deletingallobservationswithany

missingfactors.Sincestatisticalestimatorsoftenrequirecompletedata,thisapproachisthe

simplestremedyformissingness.However,itgreatlyreducestheamountofdataavailable

anddecreasesthestatisticalpoweroftheresultingstatistics.Furthermore,listwisedeletion

isvalidonlyunderstrictandunrealisticassumptions(seebelow),andwhensuchconditions

Page 16: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page14of44©2019byLo,Siah,andWong,AllRightsReserved

areviolated,inferencesarebiased.Inthecurrentstudy,wemakeanefforttoincludeinour

analysis all observed examples, with or without complete features, through the use of

statisticalimputation.

Missingdatamaybeclassified intothreecategories(Rubin,1976):missingcompletelyat

random (MCAR), missing at random (MAR), and missing not-at-random (MNAR). MCAR

refers todatathataremissing forreasonsentirely independentof thedata;MARapplies

whenthemissingnesscanbefullyaccountedforbytheobservedvariables;andMNARrefers

tosituationswhenneitherMCARnorMARisappropriate,inwhichcasetheprobabilityof

missingnessisdependentonthevalueofanunobservedvariable(VanBuuren,2012).See

SupplementaryMaterialsBfortheprecisedefinitionsofeachtypeofmissingness.

IfthemissingnessisMCAR,theobservedsamplescanbeviewedasarandomsubsampleof

the dataset. Consequently, using listwise deletion should not introduce any bias. While

convenient, this assumption is rarely satisfied in practice. In most drug-development

databases,faileddrugsaremorelikelytohavemissingfeaturesthansuccessfuldrugs(see

Table5).Clearly,MCARdoesnothold.

Applyinglistwisedeletionwhenthemissingness isnotMCARcan leadtoseverelybiased

estimates.Moreover,giventhenatureofdrug-developmentreporting,alargeportionofthe

originaldatamaybediscardedifmanyvariableshavemissingvalues.Forthesereasons,the

listwise-deletionapproachadoptedbyDiMasietal.(2015)andothersislessthanideal.

Givenonlytheobserveddata,itisimpossibletotestforMARversusMNAR(Enders,2010).

However, our knowledge of the data-collection process suggests thatMAR is a plausible

startingpoint,andwehypothesizethatthemissingnessindrugandtrialfeaturesismainly

Page 17: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page15of44©2019byLo,Siah,andWong,AllRightsReserved

accountedforbydrugdevelopmentandtrialstatusesrespectively.OurobservationsinTable

5andTable6supportthisapproach,asthemissingnessproportionsforsomefeaturesdiffer

greatlydependingontheoutcome.

OurassumptionofMARisconsistentwiththedata-collectionmethodologyintheInforma®

databases.DrugprofilesarebuiltupovertimeinPharmaprojects.Asadrugadvancestolater

phases, information about its characteristics becomes more readily available because

investigators releasemore data about pipeline drugs after each phase of clinical testing.

Informa®inputsthisinformationintoitsdatabasesastheybecomeavailableinthepublic

domainorthroughprimaryresearch.Approveddrugsaremorelikelytohavemorecomplete

profiles, while information about failed drugs tends to stay stagnant because no further

studiesareconducted.ItisveryplausiblethattheMARnatureofourdatasetsisanartifact

ofdatacollection,andbyextension,soaresimilarpharmaceuticaldatasetsextractedfrom

thepublicdomainandmaintainedinthesamefashion.Originallyintendedtotrackdrugand

trialactivities,PharmaprojectsandTrialtrovearenotstructuredtokeeptrackofinformation

updatesovertimesincetherewasnouseforit.Withouttimestampsoftheupdates,weare

notabletoeliminatetheMARartifactfromourdatasets.

Inouranalysis,weimputethemissingdataunderthemoreplausibleMARassumptionto

obtain complete datasets. In contrast to listwise deletion,we fill inmissing values using

informationintheobservedvariables.Thisallowsustoutilizedatathatwouldotherwisebe

discarded.Thereafter,wecanapplyalltheusualstatisticalestimatorsto this imputation-

completeddata.

Page 18: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page16of44©2019byLo,Siah,andWong,AllRightsReserved

MethodsOur analysis consists of two parts thatwe perform in R version 3.2.3. First, we impute

missingvaluestogeneratecompletedatasets.Next,weapplyarangeofmachine-learning

algorithmstobuildpredictivemodelsbasedontheimputeddata.Illustrationofthespecific

componentsofouranalysisappearinFig5.

Fig5.Modelingmethodologyadoptedinthisstudy.Abbreviations:CV:cross-validation.

Weformulateourtwoscenariosassupervisedbipartiterankingproblems,wherethegoalis

topredict theoutcome—successor failure—ofadrug-indicationpairgivenasetof input

features.Initially,wespliteachdatasetintotrainingandtestingsets.Foreachscenario,we

trainvariousclassifiersbasedonthecorrespondingtrainingset,andcomputetheexpected

errorofourpredictivemodelsbytestingthemontheheld-outtestingset.

Wecreatefeaturematricesfromthedatasetsbyrepresentingdrugandtrialfeaturesforeach

drug-indicationpairasvectors(seeFig6).Drug-indicationpairsassociatedwithmultiple

trialsarerepresentedbythesamenumberoffeaturevectors,e.g.,apairwithtwotrialshas

tworows.WegiveaconcreteexampleinFig6.Considerthedrug-indicationpairAnaliptin-

diabetestype2intheP2APPdataset.Werepresentitusingtwovectorrowssinceithastwo

phase2trialsinTrialtrove.Notethatthefeaturematrixisincompleteduetomissingdrug

Page 19: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page17of44©2019byLo,Siah,andWong,AllRightsReserved

andtrialfeatures.Wealsoconstructacolumnvectoroflabels,whichcontainstheoutcomes

of the drug-indication pairs. Labels are not available for pipeline drug-indication pairs

because theyare still indevelopmentand theiroutcomesare stilluncertain,hence these

observationsarenotusedtotrainourclassifiers.However,withthetrainedclassifiers,we

cangeneratepredictionsforpipelinedata.

Wespliteachdataset(excludingpipelinedrugs-indicationpairs)intotwodisjointsets,one

trainingsetandonetestingset,andformfeaturematricesforbothaccordingtothedrug-

indicationpairsineachset.Thetestingsetsserveasout-of-sampledatasetstoevaluateour

models.Therefore,wemasktheiroutcomes(thatis,wetreatthemasunknown)andaccess

themonlyattheveryendtocheckourperformance.

Fig6.Featurematrixofdataset.Eachrowcorrespondstoafeaturevector;eachfeaturecorrespondstoanentryinthevector;eachvectorhasalengthof144sincewehave31drugand113trialfeatures.Featurevectorsofalldrug-indicationpairsinthedatasetformthefeaturematrixcollectively.TrialIDisauniquetrialidentifierinTrialtrove.

Page 20: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page18of44©2019byLo,Siah,andWong,AllRightsReserved

Todealwithmissingdatainbothtrainingandtestingsets,weconsiderlistwisedeletionand

four statistical imputation techniques commonly used in social science research and

biostatistics: unconditional mean imputation, k-nearest neighbor (kNN) imputation,

multipleimputation(MI),anddecision-treealgorithms(seeSupplementaryMaterialsCfor

details). We follow best practices of the missing-data literature by including as many

relevantauxiliaryvariablesaspossible,aswellasallvariablesusedinsubsequentmodels

(Collins,Schafer,&Kam,2001;Enders,2010;Rubin,1996;Schafer&Graham,2002).This

makestheassumptionofMARmoreplausibleinourdatasets,andhelpstoreducebiasin

subsequent analyses (Schafer, 1997). In particular, it is necessary to include our target

variable—the drug-indication development status—inour imputationmodel becausewe

hypothesize thatmissingness ismainly accounted for by it. This is not an issue for the

trainingsets.However,theoutcomesinthetestingsetsaremasked,andnotsupposedtobe

known.Therefore,wetreatthetestingsetoutcomesasthoughtheyweremissingandimpute

themtogetherwithalltheothermissingfeatures.Afterimputation,wediscardtheimputed

testing-setoutcomes,anduseonlythe imputedfeaturevalues forpredictions.Wedothe

samewhenevaluatingpipelinedatasets.

Withrespect to themachine-learningalgorithm,weexploreseveral linearandnon-linear

classifiers commonly used in this literature: penalized logistic regression (PLR), random

forests(RF),neuralnetworks(NN),gradientboostingtrees(GBT),supportvectormachines

with radial basis functions (SVM), and decision trees C5.0. We implement the first five

algorithmsusingthescikit-learnpackageinPython(Pedregosaetal.,2011)andthesixth

usingtheC50packageinR(Kuhn,Weston,Coulter,&Quinlan,2014).Fortraining,weweight

each featurematrix rowexampleaccording to thenumberof trialsof the corresponding

Page 21: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page19of44©2019byLo,Siah,andWong,AllRightsReserved

drug-indicationpair.Inourearlierexample,thedrug-indicationpairAnaliptin-diabetestype

2was involved in twophase2 trials. It is representedby twovector rows in the feature

matrix(seeFig6).Bothrowsareusedastrainingexamples,andeachisweightedequally

duringtraining(0.5,sincetherearetwotrials in total).Toobtainpredictions foradrug-

indicationpair,weaveragetheoutputprobabilitiesandscoresofthecorrespondingfeature

vectorrowsthatareusedasinputstotheclassifier.

All machine-learning algorithms have hyper-parameters that affect the flexibility of the

modelandmustbetunedtoeachdatasettooptimizegoodnessoffit.Poorly-chosenhyper-

parameterscanleadtooverfitting(attributingsignaltonoise)orunderfitting(attributing

noise to signal).We tuneourparametersusingk-fold cross-validation (withk=5or10,

dependingonthesamplesize).Sincethecross-validationprocessshouldemulatethetesting

processascloselyaspossible,weincludeimputationinthecross-validationloopaswell.We

splitthetrainingsetintovalidationandnon-validationfolds.Thenwetreatvalidationfold

outcomesasmissing,andimputethemaswewouldforatestingset.Fromhere,weignore

theimputedvalidationfoldoutcomesandproceedwiththestandardvalidationprocess.

Inthefinalstep,wetestthetrainedclassifiersontheunseentestingsetsforout-of-sample

modelvalidation.Thisgivestheexpectedperformanceofourpredictivemodelsforeachof

thescenarios,usingthestandardAUCmetrictomeasuremodelperformance.

Page 22: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page20of44©2019byLo,Siah,andWong,AllRightsReserved

3 ResultsSimulationoflistwisedeletionversusimputationWe study the effects of imputation using a “gold-standard” dataset derived from the

completecasesoftheP2APPdataset(seeTable7).Tosimulatethemissingnesspresentin

theoriginaldataset,we introducemissingness in thegold-standarddatasetbasedonour

MARassumptionandthemissingnesspatternsobservedintheP2APPdataset.Werandomly

splitthedrug-indicationpairsintoatrainingset(70%)andatestingset(30%),andusefive

differentmissingdataapproaches,asdescribedinSupplementaryMaterialsC,togenerate

completetrainingsetsfromtheMARtrainingset.Weuseeachimputedtrainingsettobuild

six different predictive models (PLR, RF, NN, GBT, SVM, and C5.0) according to the

methodologyoutlined inSection2.We repeat this experiment100 times for robustness.

Table8summarizestheAUCperformanceoftheclassifiersonthegold-standardtestingsets.

SeeSupplementaryMaterialsEforamoredetaileddescriptionandresults.

Table7.Samplesizeofthegold-standarddataset(derivedfromcompletecasesofP2APP).

Counts Drug-indication

PairsPhase2Trials

UniqueDrugs

UniqueIndications

UniquePhase2Trials

Success 166 341 152 83 337Failure 812 1,672 503 158 1,549Total 978 2,013 623 171 1,872

Page 23: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page21of44©2019byLo,Siah,andWong,AllRightsReserved

Table8.AUCofdifferent classifiersunderdifferentmissingdataapproaches.Abbreviations:Avg:average;Sd:standarddeviation;5%:5thpercentile;50%:median;95%:95thpercentile;m:numberofimputationsgenerated.

Gold-StandardTestingSetAUC

ImputationMethod Machine-learningModel Avg Sd 5% 50% 95%

Gold-Standard

PLR

0.810 0.028 0.761 0.808 0.853CompleteCases 0.755 0.040 0.683 0.764 0.813Mean/mode 0.778 0.031 0.729 0.779 0.823Median/mode 0.778 0.031 0.728 0.779 0.8245NN 0.786 0.032 0.738 0.787 0.83410NN 0.787 0.032 0.739 0.791 0.835MI(m=1) 0.781 0.036 0.722 0.777 0.843MI(m=10) 0.782 0.031 0.729 0.782 0.831Gold-Standard

RF

0.837 0.027 0.793 0.837 0.876CompleteCases 0.764 0.048 0.685 0.772 0.830Mean/mode 0.775 0.031 0.726 0.771 0.822Median/mode 0.774 0.031 0.723 0.774 0.8275NN 0.805 0.033 0.755 0.805 0.85710NN 0.802 0.033 0.747 0.805 0.856MI(m=1) 0.797 0.033 0.748 0.795 0.853MI(m=10) 0.804 0.030 0.751 0.804 0.848Gold-Standard

NN

0.800 0.032 0.754 0.799 0.849CompleteCases 0.715 0.043 0.638 0.716 0.779Mean/mode 0.790 0.037 0.739 0.789 0.848Median/mode 0.789 0.036 0.740 0.792 0.8495NN 0.794 0.032 0.743 0.798 0.84210NN 0.797 0.036 0.737 0.798 0.851MI(m=1) 0.780 0.036 0.719 0.781 0.838MI(m=10) 0.795 0.030 0.750 0.795 0.838Gold-Standard

GBT

0.820 0.028 0.776 0.821 0.868CompleteCases 0.746 0.050 0.659 0.756 0.816Mean/mode 0.781 0.034 0.724 0.784 0.826Median/mode 0.778 0.033 0.719 0.783 0.8235NN 0.796 0.029 0.737 0.798 0.83710NN 0.796 0.028 0.748 0.798 0.838MI(m=1) 0.796 0.031 0.747 0.796 0.847MI(m=10) 0.804 0.031 0.757 0.803 0.854Gold-Standard

SVM

0.785 0.030 0.730 0.786 0.831CompleteCases 0.733 0.053 0.650 0.741 0.795Mean/mode 0.766 0.036 0.707 0.771 0.818Median/mode 0.764 0.035 0.711 0.771 0.8185NN 0.771 0.034 0.722 0.770 0.82710NN 0.772 0.037 0.710 0.773 0.825MI(m=1) 0.760 0.035 0.696 0.762 0.813MI(m=10) 0.768 0.030 0.719 0.764 0.813Gold-Standard

C5.0

0.800 0.033 0.758 0.800 0.844CompleteCases 0.710 0.063 0.585 0.713 0.802Mean/mode 0.758 0.039 0.698 0.762 0.816Median/mode 0.754 0.043 0.679 0.751 0.8235NN 0.772 0.038 0.715 0.772 0.84310NN 0.770 0.035 0.710 0.771 0.822MI(m=1) 0.758 0.037 0.701 0.754 0.819MI(m=10) 0.807 0.031 0.756 0.808 0.857

Page 24: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page22of44©2019byLo,Siah,andWong,AllRightsReserved

Forallsixmachine-learningalgorithms,wefindthatgold-standardclassifiers—thatis,the

modelsderivedfromcompletedata—consistentlyoutperformtheircomplete-caseanalysis

and imputation counterparts.This is logicalbecauseuseful information is invariably lost

whenwe introducemissingness in thedatasets. Incontrast,complete-caseanalysisoften

leadstoinferiorperformance.TheAUCsofclassifierstrainedoncomplete-casestrainingsets

tendtobesmallerthanthosetrainedonimputedtrainingsets.Thissuggeststhatimputation

doesindeedofferimprovedfitandpredictivepoweroverlistwisedeletion.

Overall,wefindkNNimputationtobemostcompatiblewithourdatasets.Itprovidesthe

leastbiasedimputationsamongallmissingdatamethods(seeSupplementaryMaterialsE).

Inparticular,thecombinationofkNNimputation(k=5)withRFgivesoneofthehighest

gold-standard testing set AUCs (0.81). We note a few other MI combinations that yield

comparable or marginally better performance but focus on the 5NN-RF approach in

subsequentanalysesonthemaindatasetsduetoitseaseofimplementationandapplication.

WefindthatSVMhastheworstperformanceamongallmachine-learningmodels.Thisisnot

surprisingbecauseSVMsareaimedonlyatlearningbinaryclassifiers,anddonotgenerally

producegoodclassprobabilityestimates.Consequently,suchmodelsdonotnecessarilygive

highAUCs.

WealsocompareourapproachwiththeANDIalgorithm(DiMasietal.,2015)byapplyinga

modified version of the index on oncology drugs in the gold-standard testing sets (see

SupplementaryMaterialsHforamorein-depthdescription).Wefindthatour5NN-RFmodel

achievessignificantlyhigherAUCthanthemodifiedANDI,withanaverageimprovementof

0.1inAUCover100simulations(seeFig7).Webelievethatthisgaincanbeattributedtoa

Page 25: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page23of44©2019byLo,Siah,andWong,AllRightsReserved

largertrainingsetwithawiderrangeof features,anonlinearmodel thatcancapturethe

complexrelationshipsinthedata,andapropermodelvalidationmethodology.

Fig7.DistributionsofAUCof5NN-RFandthemodifiedANDIononcology-onlygold-standardtestingsets.

PredictingapprovalsWeanalyzethetwodatasets(P2APPandP3APP)byfirstsplittingeachintoatrainingset

(70%)andatestingset(30%)randomly(pipelinedrug-indicationpairsareomittedsince

theiroutcomeshaveyettobedetermined).Subsequently,wetrain5NN-RFmodelsforeach

scenarioaccordingtothemethodologyoutlinedinSection2.Werepeatthisexperiment100

timesforrobustness.Table9summarizestheAUCperformancemetricsforthetestingsets.

Onaverage,weachieve0.78AUCforP2APPand0.81AUCforP3APP.

Page 26: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page24of44©2019byLo,Siah,andWong,AllRightsReserved

Table 9. Comparison of the general and indication-group specific classifiers for selected indication groups.Abbreviations:Avg:average;Sd:standarddeviation;5%:5thpercentile;50%:median;95%:95thpercentile.

GeneralClassifier SpecializedClassifiers Avg Sd 5% 50% 95% Avg Sd 5% 50% 95%

P2APPAll 0.777 0.017 0.749 0.775 0.806 - - - - -Anti-cancer 0.805 0.025 0.764 0.805 0.847 0.818 0.029 0.773 0.819 0.865RareDiseases 0.800 0.028 0.756 0.800 0.848 0.775 0.036 0.715 0.777 0.838Neurological 0.767 0.036 0.710 0.769 0.819 0.778 0.039 0.721 0.779 0.834Alimentary 0.749 0.045 0.672 0.751 0.817 0.732 0.048 0.651 0.734 0.807Immunological 0.783 0.065 0.665 0.786 0.889 0.766 0.069 0.646 0.775 0.860Anti-infective 0.735 0.043 0.673 0.736 0.800 0.750 0.047 0.684 0.746 0.832Respiratory 0.756 0.055 0.648 0.764 0.835 0.867 0.043 0.794 0.872 0.921Musculoskeletal 0.822 0.049 0.736 0.821 0.899 0.731 0.076 0.614 0.745 0.849Cardiovascular 0.709 0.072 0.580 0.711 0.812 0.694 0.073 0.579 0.698 0.807Genitourinary 0.633 0.086 0.503 0.634 0.790 0.706 0.091 0.552 0.710 0.840

P3APPAll 0.810 0.018 0.781 0.810 0.834 - - - - -Anti-cancer 0.783 0.047 0.699 0.779 0.853 0.707 0.054 0.612 0.714 0.786RareDiseases 0.819 0.054 0.727 0.822 0.896 0.786 0.058 0.687 0.793 0.875Neurological 0.796 0.037 0.734 0.794 0.857 0.789 0.038 0.741 0.787 0.853Alimentary 0.817 0.047 0.744 0.820 0.891 0.805 0.054 0.718 0.808 0.888Immunological 0.811 0.074 0.680 0.815 0.910 0.757 0.099 0.586 0.765 0.892Anti-infective 0.757 0.065 0.644 0.752 0.854 0.708 0.068 0.600 0.707 0.808Respiratory 0.823 0.065 0.712 0.831 0.920 0.773 0.083 0.627 0.784 0.907Musculoskeletal 0.741 0.095 0.576 0.747 0.866 0.763 0.072 0.646 0.762 0.882Cardiovascular 0.794 0.058 0.702 0.788 0.887 0.755 0.076 0.639 0.765 0.864Genitourinary 0.814 0.083 0.670 0.821 0.937 0.801 0.090 0.635 0.808 0.927

The observed performance is essentially the MAR testing set AUC, since backfilling has

alreadyaffectedthedatasetsused.InSupplementaryMaterialsE,wehighlighttheperilsof

relyingontheMARtestingsetformodelvalidation,andsuggestthattheAUCsforthegold-

standard and MCAR testing sets are more reflective of a classifier’s real performance.

Unfortunately, we have access to neither the gold-standard nor the MCAR testing sets,

becausewedonotknowthetrue,underlyingvaluesofthemissingfeatures.However,our

experiments indicate that the AUCs for the MAR and MCAR testing sets of the 5NN-RF

combinationareveryclose(adifferenceof0.002onaverage).Thismeansthatwemayuse

theformer,theonlyobservedfigure,asareasonableestimateofthelatter,whichreflects

realperformance.

Page 27: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page25of44©2019byLo,Siah,andWong,AllRightsReserved

Next,wetrainclassifiersbasedontheunionofthetrainingandtestingsets,andusethemto

generatepredictionsforpipelinedrug-indicationpairs.WegeneratepredictionsforP2APP

usingonlyinformationfromphase2trialsandforP3APPusingonlyinformationfromphase

3trials.WhilewecannotcomputeAUCscoresforthesesamplesbecausetheiroutcomesare

stillpending,wecancomparetheirpredictionscoreswiththeirdevelopmentstatusatthe

timeof thiswriting.Thesepipelinedrug-indicationpairsmaystillbe in thesameclinical

stage(nochange,i.e.,phase2forP2APP;phase3forP3APP),beterminated(failed),orhave

progressedtohigherphases(advanced).

Fig8,Table10,andTable11summarizethedistributionsofpipelinepredictionscores.We

findthatpairsthatfailgenerallyhavelowerscoresthanthosethatadvancetolaterphases

ofdevelopment.InFig8,weobservepeaksatthelowerendofthescorespectrumforfailed

pairs(red) forbothdatasets. Incontrast,pairs thatadvancetendtohavepeaksathigher

scores(green).Weobservethesamepatternswhenwedisaggregatethedistributionsby

indicationgroups:thegreenpartstendtoclusterabovethedistributionmedianwhilethe

redpartsclusterbelow.However,therearealsosomeindicationgroupsforwhichthereare

toofewsamplestomakeanyusefulremarks(e.g.,hormonalproductsinP2APP).FromTable

10,weseethattheaveragescoresoffailedpairsareindeedlowerthanthosethatadvance

(differencesrangingfrom0.05to0.15).InTable11,webindrug-indicationpairsthathave

newdevelopments (whether failure or advancement) into four groupings, dependingon

theirpredictionscores.Foreachbin,wecomputetheproportionofsamplesthatadvanceto

laterdevelopmentstages.We find that theproportionsgenerally increasewith thescore

magnitude,suggestingthatpairswithhigherscoresaremorelikelytoadvancethanthose

with lower scores.Wenote thatprogress to later clinical stagesdoesnotalways lead to

Page 28: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page26of44©2019byLo,Siah,andWong,AllRightsReserved

approval. However, the results are still promising because advancement is a necessary

condition for approval. Our experiments indicate that our trained classifiers are able to

discriminatebetweenhigh-andlow-potentialcandidates.

Table10.Distributionsofpredictionscoresforallindicationgroupsinaggregate(seeFig8).Advancedreferstoprogresstoahigherphasefromtheoriginalphase.OriginalphaseforP2APPisphase2;forP3APPisphase3.Forinstance,outof1,511drug-indicationpairsintheP2APPtestingset,859pairsarestillpendingdecisioninphase2,244pairshavefailedand408pairshavesuccessfullyadvancedtophase3testing.Abbreviations:n:samplesize;Avg:average;Sd:standarddeviation;5%:5thpercentile;50%:median;95%:95thpercentile.

PredictionScores n Avg Sd 5% 50% 95%

P2APPAggregate 1,511 0.153 0.061 0.044 0.155 0.258Nochange 859 0.143 0.060 0.041 0.147 0.246Failed 244 0.137 0.061 0.034 0.147 0.240Advanced 408 0.183 0.056 0.093 0.178 0.274

P3APPAggregate 252 0.417 0.189 0.128 0.402 0.695Nochange 142 0.392 0.185 0.129 0.384 0.693Failed 32 0.348 0.185 0.100 0.344 0.656Advanced 78 0.492 0.176 0.233 0.492 0.699

Table11.Distributionsofpredictionscoresforallindicationgroupsinaggregate(seeFig8).Proportionreferstothefractionofsamplesthatadvancedtoalaterphasefromtheoriginalphase.Abbreviations:n:samplesize.

Scores n ProportionP2APP

<0.1 108 0.2310.1-0.2 368 0.6710.2-0.3 171 0.766≥0.3 5 1.000

P3APP<0.2 13 0.3080.2-0.4 35 0.6860.4-0.6 27 0.667≥0.6 35 0.914

Page 29: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page27of44©2019byLo,Siah,andWong,AllRightsReserved

Fig8.DistributionsofpredictionscoresforP2APPandP3APP.Firstrowforallindicationgroupsinaggregate.Subsequentrowsforspecificindicationgroups.

Page 30: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page28of44©2019byLo,Siah,andWong,AllRightsReserved

To gain insight into the logic of our trained predictivemodels,we compute the average

importanceoffeaturesusedinthe5NN-RFclassifiersoveralltheexperiments,andextract

thetoptenmostinformativevariables.TheRFclassifier(Pedregosaetal.,2011)weused

computestheimportanceofavariablebyfindingthedecreaseinnodeimpurityforallnodes

thatsplitonthatvariable,weightedbytheprobabilityofreachingthatnode(asestimated

by the proportion of samples reaching that node), averaged over all trees in the forest

ensemble(Breiman,Friedman,Stone,&Olshen,1984).Table12summarizestheresults.

Table12.Toptenimportantvariablesof5NN-RFclassifiersforP2APPandP3APP.Averageandstandarddeviationtakenacrossallexperiments.Abbreviations:Avg:average;Sd:standarddeviation.

Importance Avg Sd

P2APPTrialoutcome–completed,positiveoutcome,orprimaryendpoint(s)met 0.234 0.043Trialstatus 0.160 0.026Medium–solution 0.051 0.018Actualaccrual 0.046 0.010Sponsortype–industry,allotherpharma 0.025 0.008Sponsortrackrecord–numberofpositivephase3trials 0.023 0.006Sponsortrackrecord–numberoffaileddrug-indicationpairs 0.021 0.007Studydesign–placebocontrol 0.019 0.009Targetaccrual 0.018 0.005Priorapprovalofdrugforanotherindication 0.018 0.007

P3APPTrialoutcome–completed,positiveoutcome,orprimaryendpoint(s)met 0.357 0.028Trialstatus 0.148 0.014Duration 0.099 0.016Trialoutcome–terminated,lackofefficacy 0.033 0.010Trialoutcome–completed,negativeoutcome,orprimaryendpoint(s)notmet 0.033 0.008Therapeuticarea–oncology 0.030 0.009Priorapprovalofdrugforanotherindication 0.021 0.007Actualaccrual 0.015 0.003Medium–powder 0.014 0.007Medium–solution 0.012 0.006

Wefindthattrialoutcome(whetherthetrialwascompletedwithitsprimaryendpointsmet)

and trial status (whether the trial was completed or terminated) have significant

Page 31: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page29of44©2019byLo,Siah,andWong,AllRightsReserved

associationswithsuccess.Thesetwofeatureswereconsistentlyrankedthetoptwooutof

allvariablesandacrossbothdatasets.Itiseasytoimaginethatadrug-indicationpairwhose

trialswereterminatedhasalowprobabilityofsuccessintermsofadvancingfromphases

2/3 to approval. In contrast, candidates that achieve positive outcomes certainly have a

bettershotatsuccess.Wealsoobservethatpriorapprovalofadrughasaneffectonsuccess

fornewindicationsorpatientsegmentation.Itisplausiblethatdevelopinganapproveddrug

foranewindicationhasagreaterlikelihoodofsuccessthananewcandidate.

Inaddition,trialcharacteristicssuchasaccrual,duration,andthenumberofidentifiedsites

frequentlyappearinthetoptenimportantvariables.Thereareseveralpossibleexplanations.

Forexample,trialsthatendquicklywithoutachievingprimaryendpointsmayundermine

the likelihood of success, and drugs with trials that have small accrual—and thus low

statisticalpower—mayhavealowerprobabilityofbeingapproved.

Wealsofindsponsortrackrecords—quantifiedbythenumberofpastsuccessfultrials(trials

thatachievepositiveresultsormeetprimaryendpoints)—tobeausefulfactorforprediction.

This factor has not been considered in previous related studies, but the intuition for its

predictivepowerisclear:strongtrackrecordsarelikelyassociatedwithgreaterexpertisein

drugdevelopment.

Sincedrugsdevelopedfordifferentindicationgroupsmayhaveverydifferentcharacteristics,

wemightexpectclassifierstrainedonindication-group-specificdatatooutperformgeneral

classifiers.We build and analyze such specialized classifiers by filtering the datasets by

indicationgroupbeforeperformingtheexperimentdescribedintheprevioussection.Asa

comparison,wealsobreakdown theperformanceof thegeneral classifiersby indication

Page 32: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page30of44©2019byLo,Siah,andWong,AllRightsReserved

group. Table 9 shows the results for selected indication groups. In general, we find

specializedmodelstogivepoorerperformancethangeneralmodels.Thisislikelybecause

theformeraretrainedonlessdata,whichmakesthemlessaccurateandmoresusceptibleto

overfitting.

We note that the approach adopted in this section—splitting drug-indication pairs into

trainingandtestingsetsrandomlywithoutconsideringthedatesofdevelopment—maybe

less than idealbecauseof look-aheadbias.Forexample, if the resultsof a2008 trial are

included inthetrainingset forpredictingtheoutcomeofa2004developmentpath fora

drug-indicationpair,ourmodelwillbeusingfutureinformationduringvalidation,whichcan

yieldmisleadingand impractical inferences.Toaddressthis issue, in thenextsectionwe

apply our machine-learning framework to time-series data using rolling windows that

accountfortemporalorderingintheconstructionoftrainingandtestingsets.Althoughthis

processmakesuseoflessdatawithineachestimationwindowthanwhentheentiredataset

isused,itminimizestheimpactoflook-aheadbiasandyieldsmorerealisticinferences.We

studytheeffectsofrandomsplittingversustemporalorderinginSupplementaryMaterials

J.

PredictionsovertimeDrugdevelopmenthaschangedsubstantiallyovertime,thankstonewscientificdiscoveries

and technological improvements.To reflect these changes inourpredictiveanalytics,we

adoptatimeseries,walk-forwardapproachtocreatetrainingandtestingsetsforeachofthe

twodatasets,P2APPandP3APP(seeFig9).Wesamplefive-yearrollingwindowsbetween

2004and2014fromeachdataset.Eachwindowconsistsofatrainingsetofdrug-indication

pairswhoseoutcomesbecomefinalizedwithinthewindow,andanout-of-sample,out-of-

Page 33: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page31of44©2019byLo,Siah,andWong,AllRightsReserved

timetestingsetofdrug-indicationpairsthatendedphase2orphase3testing,butarestillin

thepipelinewithundeterminedoutcomeswithin thewindow.Forexample, consider the

P2APPdataset.Wedrawthe firstwindowfrom2004–2008, trainouralgorithmondrug-

indicationpairsthatfailedorwereapprovedwithinthisperiodasthetrainingset,andapply

thetrainedmodeltopredicttheoutcomesofdrug-indicationsthatjustendedphase2testing

withinthesamewindowasthetestingset.

Fig9.Time-serieswalk-forwardanalysisapproach.Thetestingset inthelastwindow(green)comprisesdrug-indicationpairsinthepipelineatthetimeofsnapshotofthedatabases.

We evaluate the resulting classifier by comparing its predictionswith outcomes that are

realized in the future (2009–2015). This rolling-window approach yields a total of eight

overlapping trainingand testingperiodswhereanew5NN-RFmodel is trained foreach

period.Theeighthtestingperiodconsistsofdrug-indicationpairsinthepipelineatthetime

ofsnapshotofthedatabases.Unlikethefirstsevenperiods,theiroutcomesarestillpending

current development, and thereforewe cannot compute a testing AUC for this window.

However,wecanexaminethepredictionsandcomparethescoreswiththeirdevelopment

statusesatthetimeofthiswriting.

Page 34: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page32of44©2019byLo,Siah,andWong,AllRightsReserved

Fig10summarizestheresultsofthetime-seriesanalysisforthefirstsevenwindows.We

observeanincreasingtrendovertheyearsforbothP2APP(0.67inthefirstand0.80inthe

lastwindow)andP3APP(0.77inthefirstand0.88inthelastwindow).Interestingly,wenote

thattheproportionsofcompletecasesinthetrainingsetscorrelatewellwiththetimeseries

AUC(correlationcoefficient0.95forP2APPand0.90forP3APP).Wecomputetheproportion

ofcompletecasesbytakingthenumberoffeaturevectorrowswithcompleteinformation

over the total number of rows. As is apparent from Fig 10, the proportions have been

increasing over the years for both datasets. This is likely due to better data reporting

practicesbydrugdevelopers,apossibleconsequenceofFDAAA.

Fig10.Time-serieswalk-forwardanalysisforP2APPandP3APPusing5NN-RF.Weusebootstrappingtodeterminethe95%CIforAUC(dottedlines).Thedashedlinesplotthecorrespondingproportionsofcompletecasesinthetrainingsetsofeachfive-yearwindow.Abbreviations:CC:proportionofcompletecases.

Page 35: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page33of44©2019byLo,Siah,andWong,AllRightsReserved

Next,weexaminethe2011–2015window.Fig11,Table13,andTable14summarizethe

distributions of prediction scores for the P2APP and P3APP datasets. We observe very

similarpatternstothestaticpipelinepredictionsabove.Thehistograms,averagescores,and

binningofsamples indicatethatpairs that fail tendtohave lowerpredictionscoresthan

thosethatadvance.Thisshowsthatourclassifiersareindeedabletodifferentiatesuccessful

candidates.

Table13.Distributionsofpredictionscoresforallindicationgroupsinaggregate(seeFig11).Advancedreferstoprogresstoahigherphasefromtheoriginalphase.OriginalphaseforP2APPisphase2; forP3APPisphase3.Abbreviations:n:samplesize;Avg:average;Sd:standarddeviation;5%:5thpercentile;50%:median;95%:95thpercentile.

PredictionScores n Avg Sd 5% 50% 95%

P2APPAggregate 1,190 0.158 0.080 0.036 0.173 0.290Nochange 712 0.148 0.080 0.035 0.158 0.275Failed 195 0.143 0.079 0.034 0.149 0.255Advanced 283 0.197 0.071 0.068 0.200 0.323

P3APPAggregate 218 0.431 0.211 0.113 0.476 0.689Nochange 121 0.395 0.207 0.113 0.403 0.684Failed 28 0.362 0.211 0.093 0.335 0.640Advanced 69 0.521 0.193 0.149 0.631 0.707

Table14.Distributionofpredictionscoresforallindicationgroupsinaggregate(seeFig11).Proportionreferstothefractionofsamplesthatadvancedtoahigherphasefromtheoriginalphase.Abbreviations:n:samplesize.

Scores n ProportionP2APP

<0.1 99 0.3130.1-0.2 183 0.6070.2-0.3 168 0.690≥0.3 28 0.893

P3APP<0.2 17 0.4120.2-0.4 17 0.7060.4-0.6 17 0.647≥0.6 46 0.848

Page 36: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page34of44©2019byLo,Siah,andWong,AllRightsReserved

Page 37: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page35of44©2019byLo,Siah,andWong,AllRightsReserved

Fig11.Distributionsofpredictionscoresofthe2011–2015windowtestingsetforP2APPandP3APP.Firstrowforallindicationgroupsinaggregate.Subsequentrowsforspecificindicationgroups.

Page 38: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page36of44©2019byLo,Siah,andWong,AllRightsReserved

Table15summarizesthetoptenmostinformativevariablesinthe5NN-RFclassifiersover

theeightrollingwindows.Wefindthemtobelargelyconsistentwiththoseobservedinthe

staticcase:thetrialoutcomeandtrialstatusaresignificantlyassociatedwithsuccess;trial

characteristics (such as accrual, duration, and number of identified sites), sponsor track

record,anddrugmediumappearfrequentlyinbothscenarios.

Table15.Toptenimportantvariablesin5NN-RFclassifiersforP2APPandP3APP.Averageandstandarddeviationtakenacrosstheeightrollingwindows.Abbreviations:Avg:average;Sd:standarddeviation.

Importance Avg Sd

P2APPTrialoutcome–completed,positiveoutcome,orprimaryendpoint(s)met 0.203 0.083Trialstatus 0.102 0.033Priorapprovalofdrugforanotherindication 0.077 0.061Actualaccrual 0.039 0.015Targetaccrual 0.031 0.010Duration 0.027 0.014Sponsortrackrecord–numberofcompletedphase3trials 0.025 0.007Medium–suspension 0.024 0.018Sponsortype–academic 0.023 0.017Medium–solution 0.021 0.019

P3APPTrialoutcome–completed,positiveoutcome,orprimaryendpoint(s)met 0.348 0.028Trialstatus 0.125 0.020Duration 0.053 0.017Priorapprovalofdrugforanotherindication 0.046 0.028Trialoutcome–completed,negativeoutcome,orprimaryendpoint(s)notmet 0.033 0.026Targetaccrual 0.021 0.005Trialoutcome–terminated,lackofefficacy 0.020 0.013Actualaccrual 0.019 0.004Therapeuticarea–oncology 0.017 0.013Numberofidentifiedsites 0.012 0.002

Asinthestaticcase,wealsotrainindication-groupspecificclassifiersusingrollingwindows.

Table16andTable17summarizetheresultsforselectedindicationgroupsinP2APPand

P3APP, respectively (see Supplementary Materials G for results of all other indication

groups). Indication groups with small sample sizes tend to produce poor and unstable

Page 39: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page37of44©2019byLo,Siah,andWong,AllRightsReserved

specializedclassifiers(e.g.,themusculoskeletalindicationgroupinP2APP).Thisisexpected

becausemodelstrainedonsmalltrainingsetsaremoresusceptibletooverfitting,especially

whennon-linearalgorithmssuchasRFareused.Incontrast,indicationgroupswithlarger

samplesizestendtogiverisetorathergoodclassifiers(e.g.,anti-cancerinP2APP).

Table16.Comparisonof thegeneraland indication-groupspecific classifiers for selected indicationgroups inP2APP.Weusebootstrappingtodeterminethe95%CIforAUC.

GeneralClassifier SpecializedClassifiers TrainSet TestSet AUC[95%CI] TrainSet TestSet AUC[95%CI]

All2004–2008 1,361 551 0.669[0.614,0.725] - - -2005–2009 1,562 591 0.680[0.625,0.735] - - -2006–2010 1,764 636 0.712[0.668,0.755] - - -2007–2011 1,969 598 0.738[0.698,0.777] - - -2008–2012 2,082 597 0.799[0.760,0.837] - - -2009–2013 2,212 517 0.823[0.779,0.867] - - -2010–2014 2,289 380 0.797[0.718,0.876] - - -

Anti-cancer2004–2008 1,361 137 0.665[0.528,0.803] 456 137 0.683[0.533,0.833]2005–2009 1,562 163 0.739[0.618,0.861] 494 163 0.635[0.512,0.758]2006–2010 1,764 188 0.774[0.702,0.846] 546 188 0.726[0.635,0.816]2007–2011 1,969 193 0.830[0.773,0.887] 618 193 0.746[0.661,0.831]2008–2012 2,082 198 0.805[0.717,0.894] 682 198 0.760[0.665,0.855]2009–2013 2,212 177 0.852[0.783,0.922] 736 177 0.786[0.696,0.876]2010–2014 2,289 173 0.815[0.691,0.938] 791 173 0.803[0.666,0.940]

Musculoskeletal2004–2008 1,361 35 0.765[0.597,0.933] 96 35 0.704[0.512,0.896]2005–2009 1,562 38 0.716[0.489,0.944] 109 38 0.674[0.472,0.876]2006–2010 1,764 35 0.634[0.439,0.830] 111 35 0.509[0.276,0.742]2007–2011 1,969 37 0.737[0.571,0.903] 119 37 0.677[0.493,0.860]2008–2012 2,082 36 0.884[0.773,0.995] 127 36 0.683[0.462,0.904]2009–2013 2,212 26 0.792[0.573,1.000] 133 26 0.667[0.429,0.904]2010–2014 2,289 19 0.882[0.724,1.000] 128 19 0.882[0.706,1.000]

Page 40: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page38of44©2019byLo,Siah,andWong,AllRightsReserved

Table17.Comparisonof thegeneraland indication-groupspecific classifiers for selected indicationgroups inP3APP.Weusebootstrappingtodeterminethe95%CIforAUC.

GeneralClassifier SpecializedClassifiers TrainSet TestSet AUC[95%CI] TrainSet TestSet AUC[95%CI]

All2004–2008 472 196 0.769[0.704,0.834] - - -2005–2009 559 177 0.724[0.650,0.798] - - -2006–2010 604 211 0.738[0.671,0.805] - - -2007–2011 664 174 0.806[0.740,0.871] - - -2008–2012 677 197 0.827[0.768,0.886] - - -2009–2013 740 153 0.868[0.809,0.927] - - -2010–2014 734 110 0.876[0.811,0.941] - - -

Anti-cancer2004–2008 472 34 0.773[0.618,0.928] 95 34 0.684[0.495,0.874]2005–2009 559 28 0.740[0.543,0.936] 107 28 0.568[0.345,0.791]2006–2010 604 50 0.754[0.599,0.910] 110 50 0.630[0.452,0.809]2007–2011 664 24 0.587[0.333,0.842] 132 24 0.392[0.132,0.651]2008–2012 677 40 0.793[0.549,1.000] 134 40 0.668[0.457,0.879]2009–2013 740 29 0.800[0.480,1.000] 151 29 0.775[0.528,1.000]2010–2014 734 26 0.943[0.842,1.000] 153 26 0.852[0.558,1.000]

RareDiseases2004–2008 472 22 0.711[0.465,0.957] 54 22 0.620[0.364,0.876]2005–2009 559 23 0.735[0.517,0.952] 60 23 0.606[0.360,0.852]2006–2010 604 24 0.888[0.747,1.000] 66 24 0.825[0.645,1.000]2007–2011 664 22 0.838[0.652,1.000] 72 22 0.735[0.520,0.950]2008–2012 677 34 0.893[0.780,1.000] 76 34 0.700[0.523,0.877]2009–2013 740 28 0.962[0.899,1.000] 94 28 0.932[0.840,1.000]2010–2014 734 18 0.908[0.766,1.000] 109 18 0.985[0.942,1.000]

For comparison, we disaggregate performance by indication group. We find that these

classifiersdonotloseouttotheirspecializedcounterparts.Infact,ourresultsshowthatthe

formertendtoexhibitmorestableperformanceacrossthesevenwindows,particularlyon

indicationgroupswithsmallsamplesizes.Wehypothesizethatclassifierstrainedonalldata

benefit fromhavingaccess to largerdatasetswithgreaterdiversity, andare thusable to

makemoreinformedpredictions.Thissuggeststhatitmaybemoreappropriatetorelyon

generalclassifiers,ratherthanspecializedones,forpredictionsovertimewheresamplesare

spreadoutovermultiplewindows,sincefurtherfilteringbyindicationgroupresultsineven

smallersamplesizes.

Page 41: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page39of44©2019byLo,Siah,andWong,AllRightsReserved

Finally,weextractthetopfiveP2APPpipelinedrugcandidateswiththehighestscoresin

each indication group as predicted by the 2011–2015 rolling-window model. Table 18

summarizestheresults.Weincludeonlycandidatesthatarestilloutstandingatthetimeof

writing(neitherdiscontinuednorapproved).Itisencouragingthatmanyofthesecandidates

(indicatedinitalics)haveadvancedbeyondphase2testingsinceouranalysis,indicatingthe

predictivepowerofourmodels.WeincludeaninteractiveversioninFig12wherereaders

canfilterourpipelinepredictionsbyindicationgroupandprobabilityofapproval.Ultimately,

allbiopharmastakeholderscanusesuchscorestorankandevaluatethepotentialrisksand

rewardsofdrugcandidates.

Fig12.NetworkgraphofP2APPpipelinedrugcandidates.Blacknodescorrespondtoindicationgroups.Colorednodescorrespondtodrug-indicationpairs.Eachdrug-indicationpairnodeisconnectedtoitsparentindicationgroupandalsootherdrug-indicationpairs thathave thesame indication.Theyarecoloredaccordingto theirrespectiveprobabilityofapprovalaspredictedbyourmodel—blueforhigherscoresandredforlowerscores.Hoverovernodesfordetailsofeachdrug-indicationpair.Blackindicationgroupnodesaresizedbasedonthenumberofconnections.

Theinformationpresentedthroughthisresearchandintheincludedfigurearemadeavailablesolelyforgeneralinformationalpurposes.Theauthorsdonotwarranttheaccuracy,completenessorusefulnessofthisinformation.Anyrelianceyouplaceonsuchinformationisstrictlyatyourownrisk.Theauthorsexpresslydisclaimallliabilityandresponsibilityarisingfromanyrelianceplacedonsuchinformationbyyou,orbyanyonewhomaybeinformedofthisinformation.

Page 42: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page40of44©2019byLo,Siah,andWong,AllRightsReserved

Table18.TopfiveP2APPpipelinedrugcandidateswiththehighestscoresineachindicationgroupaspredictedbyourmodel.Weincludeonlycandidatesthatarestilloutstandingatthetimeofwriting(neitherdiscontinuednorapproved).Drug-indicationpairs in italicsare those thathaveadvancedbeyondphase2 testingsinceouranalysis.

Drug Indication Score Drug Indication ScoreAnti-cancer Musculoskeletalontecizumab Cancer,colorectal 0.34 tofacitinib Arthritis,psoriatic 0.31

calmangafodipirRadio/chemotherapy-inducedinjury,bonemarrow,neutropenia

0.31 ixekizumab Arthritis,rheumatoid 0.31

tivantinib Cancer,sarcoma,softtissue 0.30 anti-BLyS/APRILantibodyfusionprotein Arthritis,rheumatoid 0.31

pidilizumab Cancer,colorectal 0.29 sirukumab Arthritis,rheumatoid 0.29NK-012 Cancer,colorectal 0.28 romosozumab Osteoporosis 0.28RareDiseases Cardiovascularsurotomycin Infection,Clostridiumdifficile 0.34 K-134 Peripheralvasculardisease 0.37tivantinib Cancer,sarcoma,softtissue 0.30 nitricoxide,inhaled Hypertension,pulmonary 0.29

VP-20621 Infection,Clostridiumdifficileprophylaxis 0.30 TY-51924 Infarction,myocardial 0.28

KHK-7580 Secondaryhyperparathyroidism 0.29 s-amlodipine+telmisartan Hypertension,unspecified 0.27nitricoxide,inhaled Hypertension,pulmonary 0.29 tirasemtiv Peripheralvasculardisease 0.24Neurological Dermatological

dasotraline Attentiondeficithyperactivitydisorder 0.35 tofacitinib Arthritis,psoriatic 0.31

idalopirdine Alzheimer'sdisease 0.35 dimethylfumarate Psoriasis 0.27GRC-17536 Neuropathy,diabetic 0.34 pefcalcitol Psoriasis 0.24caprylictriglyceride Alzheimer'sdisease 0.32 Benvitimod Psoriasis 0.22

levodopa Parkinson'sdisease 0.31 calcipotriolmonohydrate+betamethasonedipropionate

Psoriasis 0.22

Alimentary Genitourinary

ibodutant Irritablebowelsyndrome,diarrhoea-predominant 0.37

etonogestrel+estradiol(vaginalring),nextgeneration

Contraceptive,female 0.30

GRC-17536 Neuropathy,diabetic 0.34 drospirenone+estradiol Contraceptive,female 0.28mesalazine+N-acetylcysteine Colitis,ulcerative 0.31 finerenone Nephropathy,diabetic 0.27

apabetalone(tablet) Diabetes,Type2 0.31 afacifenacinfumarate Overactivebladder 0.26phosphatidylcholine Colitis,ulcerative 0.31 GKT-137831 Nephropathy,diabetic 0.26Immunological BloodandClotting

tofacitinib Arthritis,psoriatic 0.31 calmangafodipirRadio/chemotherapy-inducedinjury,bonemarrow,neutropenia

0.31

ixekizumab Arthritis,rheumatoid 0.31 balugrastim Radio/chemotherapy-inducedinjury,bonemarrow,neutropenia 0.27

anti-BLyS/APRILantibodyfusionprotein Arthritis,rheumatoid 0.31 eflapegrastim Radio/chemotherapy-induced

injury,bonemarrow,neutropenia 0.25

sirukumab Arthritis,rheumatoid 0.29 pegfilgrastim Radio/chemotherapy-inducedinjury,bonemarrow,neutropenia 0.22

dimethylfumarate Psoriasis 0.27 lexaptepidpegol Radio/chemotherapy-inducedanaemia 0.20

Anti-infective Sensory

delafloxacin Infection,skinandskinstructure,acutebacterial 0.39 AR-13324+latanoprost Glaucoma 0.27

surotomycin Infection,Clostridiumdifficile 0.34 S-646240 Maculardegeneration,age-related,wet 0.27

delafloxacin Infection,pneumonia,community-acquired 0.33 netarsudil Glaucoma 0.26

plazomicin Infection,urinarytract,complicated 0.33 fenofibrate,micronized-2 Oedema,macular,diabetic 0.25

Ypeginterferonalpha-2b Infection,hepatitis-Cvirus 0.33 LX-7101 Glaucoma 0.21Respiratory Hormonalfluticasone+salmeterol Asthma 0.36 KHK-7580 Secondaryhyperparathyroidism 0.29fluticasonefuroate+umeclidinium+vilanterol

Chronicobstructivepulmonarydisease 0.36 somatropinprodrug,

pegylated Growthhormonedeficiency 0.21

fluticasonefuroate+umeclidinium

Chronicobstructivepulmonarydisease 0.36 2MD Secondaryhyperparathyroidism 0.21

beclometasone+formoterol

Chronicobstructivepulmonarydisease 0.35 velcalcetide Secondaryhyperparathyroidism 0.19

fluticasonepropionateDPI Asthma 0.35 tesamorelinacetate Growthhormonedeficiency 0.18

Page 43: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page41of44©2019byLo,Siah,andWong,AllRightsReserved

4 DiscussionDrugdevelopmentisanextremelycostlyprocessandtheaccurateevaluationofacandidate

drug’s likelihood of approval is critical to the efficient allocation of capital. Historical

successes and failures contain valuable insights on the characteristics of high-potential

candidates. Unfortunately, such data are often incomplete due to partial reporting by

investigatorsanddevelopers.Mostanalyticmethodsrequirecompletedata,however,and

priorstudiesonestimatingapprovalratesandpredictingapprovalsaretypicallybasedona

smallnumberofexamplesthathavecompleteinformationforjustafewfeatures.

In thispaper,we extract twodatasets, P2APP andP3APP, from Informa®databases and

apply5NNstatisticalimputationtomakeefficientuseofallavailabledata.Weusemachine-

learningtechniquestotrainandvalidateourRFpredictivemodelsandachievepromising

levelsofpredictivepowerforbothdatasets.Whenappliedtopipelinedrugs,wefindthat

candidateswithhigherscoresareindeedmorelikelytoadvancetohigherclinicalphases,

indicating that our 5NN-RF classifiers are able to discriminate between high- and low-

potentialcandidates.

Atime-seriesanalysisofthedatasetsshowsgenerallyincreasingtrendsinperformanceover

five-year rolling windows from 2004 to 2014.We find that the classifiers’ performance

correlateswellwiththeproportionsofcompletecasesinthetrainingsets:ascompleteness

increases, the classifier learns better and achieves higher AUCs. This highlights the

importance of data quality in building more accurate predictive algorithms for drug

development.

Page 44: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page42of44©2019byLo,Siah,andWong,AllRightsReserved

Finally,wecomputefeatureimportanceinthepredictivemodelsandfindthattrialoutcomes,

trialstatus,trialaccrualrates,duration,priorapprovalforanotherindication,andsponsor

track records are the most critical features for predicting success. Because the 5NN-RF

classifiersarenon-linear,thereisnosimpleinterpretationoftheincrementalcontribution

ofeachpredictortotheforecast.However,theintuitionbehindsomeofthesefactorsisclear:

drug-indication pairs with trials that achieve positive outcomes certainly have a better

chance of approval; candidates sponsored by companies with strong track records and

greater expertise in drug development should have higher likelihood of success; and

approveddrugsmayhavehigherchancesofapprovalforasecondrelatedindication.Many

ofthesefactorscontainusefulsignalsaboutdrugdevelopmentoutcomesbuthavenotbeen

consideredinpriorstudies.

These results are promising and raise the possibility of even more powerful drug

developmentpredictionmodelswithaccess tobetterqualitydata.This canbedrivenby

programssuchasProjectDataSphere(Greenetal.,2015)andVivli(Bierer,Li,Barnes,&Sim,

2016)thatpromoteandfacilitatepublicsharingofpatient-levelclinicaltrialdata.Ultimately,

suchpredictiveanalyticscanbeusedtomakemoreinformeddata-drivendecisionsinthe

riskassessmentandportfoliomanagementofinvestigationaldrugsatallclinicalstages.

Page 45: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page43of44©2019byLo,Siah,andWong,AllRightsReserved

ReferencesBierer,B.E.,Li,R.,Barnes,M.,&Sim,I.(2016).Aglobal,neutralplatformforsharingtrialdata.New

EnglandJournalofMedicine,374(25),2411-2413.

Breiman,L.,Friedman,J.,Stone,C.J.,&Olshen,R.A.(1984).Classificationandregressiontrees.CRCPress.

Collins,L.M.,Schafer,J.L.,&Kam,C.M.(2001).Acomparisonofinclusiveandrestrictivestrategiesinmodernmissingdataprocedures.PsychologicalMethods,6(4),330.

DiMasi,J.A.,Hermann,J.C.,Twyman,K.,Kondru,R.K.,Stergiopoulos,S.,Getz,K.A.,&Rackoff,W.(2015). A tool for predicting regulatory approval after phase II testing of new oncologycompounds.ClinicalPharmacology&Therapeutics,98(5),506-513.

El-Maraghi, R.H., & Eisenhauer, E. A. (2008).Review of phase II trial designs used in studies ofmolecular targetedagents:outcomesandpredictorsofsuccess inphase III. JournalofClinicalOncology,26(8),1346-1354.

Enders,C.K.(2010).Appliedmissingdataanalysis.GuilfordPress.

Fawcett,T.(2006).AnintroductiontoROCanalysis.PatternRecognitionletters,27(8),861-874.

Fernandez, J. M., Stein, R. M., & Lo, A.W. (2012). Commercializing biomedical research throughsecuritizationtechniques.NatureBiotechnology,30(10),964-975.

Goffin,J.,Baral,S.,Tu,D.,Nomikos,D.,&Seymour,L.(2005).Objectiveresponsesinpatientswithmalignant melanoma or renal cell cancer in early clinical studies do not predict regulatoryapproval.ClinicalCancerResearch,11(16),5928-5934.

Green,A.K.,Reeder-Hayes,K.E.,Corty,R.W.,Basch,E.,Milowsky,M.I.,Dusetzina,S.B.,Bennett,A.V.,&Wood,W.A.(2015).Theprojectdatasphereinitiative:acceleratingcancerresearchbysharingdata.TheOncologist,20(5),464-e20.

Informa-PharmaceuticalClinicalTrialIntelligenceProducts.(2016).Informa.Retrieved5December2016, from https://pharmaintelligence.informa.com/products-and-services/data-and-analysis/citeline-joins-informas-pharma-intelligence

Jardim,D.L.,Groves,E.S.,Breitfeld,P.P.,&Kurzrock,R.(2017).Factorsassociatedwithfailureofoncology drugs in late-stage clinical development: A systematic review. Cancer TreatmentReviews,52,12-21.

Kuhn,M.,Weston,S.,Coulter,N.,&Quinlan,R.(2014).C50:C5.0decisiontreesandrule-basedmodels.Rpackageversion0.1.0-21.

Malik,L.,Mejia,A.,Parsons,H.,Ehler,B.,Mahalingam,D.,Brenner,A.,SarantopoulosJ.,&Weitman,S.(2014).PredictingsuccessinregulatoryapprovalfromPhaseIresults.CancerChemotherapyandPharmacology,74(5),1099-1103.

Pedregosa,F.,Varoquaux,G.,Gramfort,A.,Michel,V.,Thirion,B.,Grisel,O.,BlondelM.,PrettenhoferP.,WeissR.,DubourgV.,&Vanderplas,J.(2011).Scikit-learn:MachinelearninginPython.JournalofMachineLearningResearch,12(Oct),2825-2830.

Rubin,D.B.(1976).Inferenceandmissingdata.Biometrika,63(3),581-592.

Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American StatisticalAssociation,91(434),473-489.

Page 46: Machine learning with statistical imputation for ......approval dataset includes more than 6,000 unique drugs for 288 indications and over 14,500 phase 2 trials, and the phase-3-to-approval

HDSR

Issue1

19May2019 MachineLearningPredictionsofDrugApprovals Page44of44©2019byLo,Siah,andWong,AllRightsReserved

Scannell, J., Blanckley, A., Boldon, H. & Warrington, B. (2012). Diagnosing the decline inpharmaceuticalR&Defficiency.NatureReviewsDrugDiscovery11,191–200.

Schafer,J.L.(1997).Analysisofincompletemultivariatedata.CRCPress.

Schafer, J.L.,&Graham, J.W. (2002).Missingdata:ourviewof thestateof theart.PsychologicalMethods,7(2),147-177.

VanBuuren,S.(2012).Flexibleimputationofmissingdata.CRCPress.

Zarin,D.A.,Tse,T.,Williams,R.J.,&Carr,S.(2016).Trialreportinginclinicaltrials.gov—thefinalrule.NewEnglandJournalofMedicine,375(20),1998-2004.