past is prologue: limitations of statistical prediction ... · 11/22/2019 · derek j smolenski,...

Past Is Prologue: Limitations of Statistical Prediction Persist in Predictive Modeling

“Medically Ready Force…Ready Medical Force” 1

Derek J Smolenski, PhD, MPH

Epidemiologist Psychological Health Center of Excellence, Defense Health Agency

UNCLASSIFIED

Disclosure

Derek J Smolenski, has nothing to disclose.

Disclosure will be made when a product is discussed for an unapproved use.

The views expressed in this presentation are those of the presenters and do not necessarily reflect the official policy or position of the Department of Defense (DoD) or the U.S. Government.

This continuing education activity is managed and accredited by AffinityCE in collaboration with AMSUS. AffinityCE and AMSUS staff as well as Planners and Reviewers, have no relevant financial or non-financial interests to disclose.

Commercial Support was not received for this activity

“Medically Ready Force…Ready Medical Force” 2UNCLASSIFIED

Objectives

Participants will be able to

discriminate between sensitivity and positive predictive value.

identify two strategies to improve the positive predictive performance of a predictive algorithm.

explain why the low prevalence of a target outcome is detrimental to the positive predictive value.


Overview

∎ Introduction

∎Historical perspectives

∎ Key concepts

∎Overview of literature

∎ Simulation findings

∎ Clinical utility

∎ Summary


Introduction

∎Death by suicide is a concern for both the US general population and the military population

∎ Rates for both groups have shown increases over time (DoDSER, 2017)

∎ Statistical models proposed to improve potential case identification

∎Unclear how useful these models will be in practical application

∎ Recently reviewed by Belsher et al. (2019)


Historical Perspectives

∎ “Using empirically derived schedules to predict suicide with any clinical certainty is unlikely” (Mackinnon & Farberow, 1976; p. 91) An instrument that has a 17-20% positive predictive accuracy could be

useful

∎ Low base rate and instrumentation issues (Pokorny, 1983)

∎ Accurate assessment and clinical utility differ – for violence prediction, insufficiently accurate to sort individuals into substantively distinct risk groups (Mossman, 2000)

∎ Inaccuracies in actuarial and clinical risk assessment, and lack of evidence of meaningful clinical intervention (Undrill, 2007)


Key Concepts

∎Accuracy (𝑎 + 𝑑)/𝑁

∎ Sensitivity (Se; Recall) 𝑎/𝑁𝑝

∎ Specificity (Sp) 𝑑/𝑁(1 − 𝑝)

∎ Positive predictive value (PPV; Precision) 𝑎/(𝑎 + 𝑏)


Suicide No Suicide

Positive a b a+b

Negative c d c+d

Np N(1-p) N

Key Concepts

∎ Sensitivity and specificity depend on classification threshold

Tend to be stable across populations

∎ Predictive values heavily influenced by population prevalence in addition to classification thresholds


Positive Predictive Value

01

02

03

04

05

06

07

08

09

01

00

PP

V (

%)

0 20 40 60 80 100

Prevalence (%)

Se=30, Sp=99 Se=50, Sp=95 Se=80, Sp=50

Se=99, Sp=99 PPV = 50%


Positive Predictive Value

01

02

03

04

05

06

07

08

09

01

00

PP

V (

%)

0 .25 .5 .75 1 1.25 1.5 1.75 2

Prevalence (%)

Se=30, Sp=99 Se=50, Sp=95 Se=80, Sp=50

Se=99, Sp=99 PPV = 50%


Area Under the Receiver-Operating Characteristic Curve

02

04

06

08

01

00

Se

nsitiv

ity (

%)

0 20 40 60 80 100

1-Specificity (%)

Model Random


AUC = 0.86

Data from Simon et al, 2018

Advances in Predictive Models

∎ Enhanced computing capabilities

∎Machine-learning algorithms

∎ Intensive validation techniques


Application in the Literature

∎ Systematic literature review of suicide mortality and suicide attempt prediction models

∎ 17 prospective studies included

∎ AUC values were considered ‘good’ across models at 0.80 or above

∎ Positive predictive values were very low (<1%) in most instances.

Driven in large part by low base rate

∎ Risk predicted over set time horizons (e.g., 30-,90-days; 3 months, 1 year)


Simulation

∎Used estimates from the literature of sensitivity and risk threshold to simulate results in different population configurations

Population risk = 200, 500, 1000, and 2000 per 1,000,000 individuals (200 per 1,000,000 is proximal to US adult population annual suicide mortality rate [WISQARS])

Thresholds = 99th, 95th, 90th, and 50th percentile

Sensitivity means = 0.12, 0.23, 0.44, 0.82 corresponding to thresholds above


Results

0

500

100

01

50

02

00

02

50

03

00

0

Ind

ivid

uals

99th 95th 90th 50th

True Positive False Negative No. Needed


Base rate = 200 per 1M

Results

0

500

100

01

50

02

00

02

50

03

00

0

Ind

ivid

uals

99th 95th 90th 50th

True Positive False Negative No. Needed


Base rate = 1000 per 1M

Results

AUC

02

04

06

08

01

00

Se

nsitiv

ity (

%)

0 20 40 60 80 100

1-Specificity (%)

Model Random

Precision-Recall (Saito & Rehmsmeier, 2015)

02

04

06

08

01

00

Pre

cis

ion

(P

PV

; %

)

0 20 40 60 80 100

Recall (Sensitivity; %)

Model Random


Data from Simon et al, 2018

Results

0.1

.2.3

.4.5

.6.7

.8.9

1

Pre

cis

ion

(P

PV

; %

)

0 20 40 60 80 100

Recall (Sensitivity; %)

Model Prevalence


AUC = 0.005

Results

∎ Two-stage simulation didn’t improve performance dramatically

∎ Populations with higher base rate performed better

Argues against whole-population implementation

∎AUC estimates provided overly positive assessment of model accuracy

∎Models can be effective as an exclusionary measure (good negative predictive value), but not inclusionary (Streiner, 2003)


Clinical Utility

∎ Clinical utility (CU) index (Mitchell, 2011)

𝐶𝑈 += 𝑆𝑒 × 𝑃𝑃𝑉

𝐶𝑈 −= 𝑆𝑝 × 𝑁𝑃𝑉

Values <0.49 (49%) subjectively considered not useful

∎Decision curve analysis (Steyerberg et al., 2010)

Compares various courses of action to identify best choice (net benefit)

Varies by conditional risk threshold


Clinical Utility

Prevalence Threshold CU+ (%) CU- (%)

200 99 0.03 98.98

95 0.02 94.99

90 0.04 90.00

50 0.03 50.00

1000 99 0.14 98.92

95 0.11 94.94

90 0.20 89.98

50 0.14 50.01


Decision Curve Analysis

𝑇𝑃

𝑁−𝐹𝑃

𝑁

𝑝𝑡1 − 𝑝𝑡

∎Assume 200 per 1M population risk and 95th

percentile risk threshold

∎Options

Treat no one

NB = 0

Treat everyone

Treat those identified by the model


Decision Curve Analysis

-300

-200

-100

0

100

Net B

ene

fit (p

er

1M

ind

ivid

uals

)

0 2000 4000 6000 8000 10000

Risk Threshold (per 1M individuals)

Se=.10, Sp=.99 Se=.25, Sp=.95

Se=.44, Sp=.90 Se=.82, Sp=.50

Se=.50, Sp=.95 No Intervention

All Intervention


2000 per 1M = 500 individuals per positive case

Ways Ahead

∎ Consider modeling in subsets with higher base rate

∎ Improve description of accuracy

∎ Consideration of interventions post positive screening

How many false positives are we willing to tolerate?

How effective is any intervention?

What is the resource burden?

Opportunity costs?


References

Belsher, B., Smolenski, D., Pruitt, L., Bush, N. B., EH, Workman, D., Morgan, R., . . . Skopp, N. (2019). Prediction models for suicide attempts and deaths: a systematic review and simulation. JAMA Psychiatry, 76(6), 642-651. doi:10.1001/jamapsyhiatry.2019.0174

Pruitt, L., Smolenski, D., Tucker, J., Issa, F., Chodacki, J., McGraw, K., & Kennedy, C. (2019). DoDSER: Department of Defense Suicide Event Report Calendar Year 2017 Annual Report. Retrieved from https://www.pdhealth.mil/research-analytics/department-defense-suicide-event-report-dodser.

MacKinnon, D. R., & Farberow, N. L. (1976). An assessment of the utility of suicide prediction. Suicide and life-threatening behavior, 6(2), 86-92.

Mitchell, A. J. (2011). Sensitivity X PPV is a recognized test called the clinical utility index (CUI+). European Journal of Epidemiology, 26, 251-252. doi:10.1107/s10654-011-9561-x

Mossman, D. (2000). Assessing the risk of violence--are "accurate" predictions useful? Journal of the American Academy of Psychiatry Law, 28, 272-281.

Pokorny, A. D. (1983). Prediction of suicide in psychiatric patients. Archives of General Psychiatry, 40, 249-257.


References

Saito, T., & Rehmsmeier, M. (2015). The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLOS One, 10(3). doi:10.1371/journal.pone.0118432

Simon, G. E., son, E. J., Lawrence, J. M., Rossom, R. C., Ahmedani, R., Lynch, F. L., . . . Shortreed, S. M. (2018). Predicting suicide attempts and suicide deaths following outpatient visits using electronic health records. American Journal of Psychicatry, 175(10), 951-960. doi:10.1176/appi.ajp.2018.17101167

Steyerberg, E. W., Vickers, A. J., Cook, N. R., Gerds, T., Gonen, M., Obuchowski, N., . . . Kattan, M. W. (2009). Asssessing the performance of prediction models: A framework for traditional and novel measures. Epidemiology, 21(1), 128-138. doi:10.1097/EDE.0b013e3181c30fb2

Streiner, D. L. (2003). Diagnosing tests: using and misusing diagnostic and screening tests. Journal of Personality Assessment, 81(3), 209-219.

Undrill, G. (2007). The risks of risk assessment. Advances in Psychiatric Treatment, 13, 291-297. doi:10.1192/apt.bp.106.003160

Vickers, A. J. (2008). Decision analysis for the evaluation of diagnostic tests, prediction models and molecular markers. American Statistician, 62(4), 314-320. doi:10.1198/000313008X370302


How to Earn CE If you would like to earn continuing education credit for this activity, please visit:

http://amsus.cds.pesgce.com

Hurry, CE Certificates will only be available for 30 Days after this event!


How to Earn CE

past is prologue: limitations of statistical prediction ... · 11/22/2019 · derek j smolenski,...

Documents