4.08 million patients’ health-care claim records over [2005 to 2013 ]

1
4.08 million patients’ health-care claim records over [2005 to 2013] Socio-economic Data (1.9 million patients) Prediction of Type II Diabetes from Administr l Records Narges Razavian, Rahul G. Krishnan, David Sontag rant Institute of Mathematical Sciences, New York University, New York City Project Goals Data Eligibility records Prediction and analysis of disease trajectories in patients for: Personalized disease intervention discovery New medical insight in disease mechanisms Population policy design In this poster: Early Prediction of Type II Diabetes Medical/ Encounter Claim data Lab tests Medication prescriptions Methodology Data Representation: Features from patient records up to time T Diabetes Label: If patient has diabetes onset between T and T+W Models L1-Regularized Logistic Regression Decision Tree Gradient Boosted Decision Tree Current Parameters • T=2011, W = 24 months • Training set: 437K cases, 4% positive • Validation set: 237K Feature s 1000x8 LAB Features Experimental Results I L1 regularized Logistic Regression Validation Set Area Under the Curve Baseline (22 risk factors used in Medical Literature) 0.709 Extensive Features (33435 Features) 0.751 Experimental Results II Focusing on sensitivity for patients with highest predicted probability of developing diabetes Nonlinear Models: Decision Trees and Gradient Boosted Decision Trees Model / Trained on Patients with P logit (diabetes=1)>0.57 Validation Set AUC (30K patients 10% positive) Best L1-regularized Logistic Regression Model 0.5903 Best Decision Tree Model Feature Selection via L1 regularized Logit 0.6125 Best Gradient Boosted Decision Tree Model Feature Selection via L1 regularized Logit 0.6322 Discussions and Future Work Features from medical records improve the prediction accuracy compared to existing risk factors in literature •Nonlinear models improve prediction specificity for high risk patients •Nontrivial interventional features discovered suggest further casual inference analysis

Upload: cindy

Post on 25-Feb-2016

75 views

Category:

Documents


0 download

DESCRIPTION

1000x8 LAB Features. Early Prediction of Type II Diabetes from Administrative Medical Records. Narges Razavian, Rahul G. Krishnan, David Sontag. Courant Institute of Mathematical Sciences, New York University, New York City. Features. Project Goals. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 4.08 million patients’ health-care  claim     records  over [2005 to 2013 ]

• 4.08 million patients’ health-care claim records over [2005 to 2013]

• Socio-economic Data (1.9 million patients)

Early Prediction of Type II Diabetes from Administrative Medical Records

Narges Razavian, Rahul G. Krishnan, David SontagCourant Institute of Mathematical Sciences, New York University, New York City

Project Goals

Data

Eligibility records

Prediction and analysis of disease trajectories in patients for:• Personalized disease intervention

discovery• New medical insight in disease

mechanisms• Population policy design• In this poster: Early Prediction of

Type II Diabetes

Medical/Encounter Claim dataLab tests

Medication prescriptions

MethodologyData Representation: Features from patient records up to time TDiabetes Label: If patient has diabetes onset between T and T+W

Models • L1-Regularized Logistic Regression• Decision Tree• Gradient Boosted Decision TreeCurrent Parameters• T=2011, W = 24 months• Training set: 437K cases, 4% positive• Validation set: 237K cases, 4% positive

Features

1000x8LAB Features

Experimental Results IL1 regularized Logistic Regression

Validation Set Area Under the Curve

Baseline (22 risk factors used in Medical Literature)

0.709

Extensive Features (33435 Features)

0.751

Experimental Results II• Focusing on sensitivity for patients with highest predicted probability of developing diabetes • Nonlinear Models: Decision Trees and Gradient

Boosted Decision TreesModel / Trained on Patients with Plogit(diabetes=1)>0.57

Validation Set AUC(30K patients 10% positive)

Best L1-regularized Logistic Regression Model 0.5903

Best Decision Tree ModelFeature Selection via L1 regularized Logit

0.6125

Best Gradient Boosted Decision Tree Model Feature Selection via L1 regularized Logit

0.6322

Discussions and Future Work• Features from medical records improve the prediction accuracy compared to existing risk factors in literature• Nonlinear models improve prediction specificity for high risk patients• Nontrivial interventional features discovered suggest further casual inference analysis