acm bcb 2015 xun lu 1*, aston zhang 1*, carl a. gunter 1, daniel fabbri 2, david liebovitz 3,...
TRANSCRIPT
Discovering De Facto Diagnosis Specialties
ACM BCB 2015
Xun Lu1*, Aston Zhang1*, Carl A. Gunter1, Daniel Fabbri2, David Liebovitz3, Bradley Malin2
1University of Illinois at Urbana-Champaign, 2Vanderbilt University, 3Northwestern University
Presented by Aston Zhang | Sep 11, 2015*Equal contributors
Medical specialties provide information about which providers have the skills needed to carry out key procedures or make critical judgments
However, organizing specialties into departments or wards has limitations Some specialties may be lacking or inaccurate (they are
not always entered for new hire documents) Employees can change roles Encoded departments do not always align with
specialties As a result, there could be a gap between
diagnosis histories of certain providers and their specialties
Medical specialties are useful, but could be inconsistent with diagnosis histories
2
Providers select from Health Care Provider Taxonomy Code Set (HPTCS) when they apply for their National Provider Identifiers (NPI)
However, providers may not always choose their taxonomy codes based on the certifications they hold National Plan & Provider Enumeration System does not
verify the selected taxonomy code Certain taxonomy codes do not correspond to any
nationwide certifications that are approved by a professional board (e.g., Men and Masculinity)
Some national certifications are not reflected by the taxonomy code list
National Provider Identifiers (NPI) are not always accurate
3
As we have seen, there are limitations in purely relying on NPI taxonomy codes
Hence, we propose to leverage real-world diagnosis histories to infer and recognize actual specialties De facto specialties are medical specialties
that exist in practice regardless of the specialty codes (NPI taxonomy codes)
De facto diagnosis specialties are medical specialties that exist in practice and are highly predictable by the diagnoses inherent in the EHRs
We leverage diagnosis histories to infer and recognize actual specialties (de facto specialties)
4
Urology is an example of diagnosis specialty as opposed to anesthesiology It should be easier to characterize a
urologist in terms of medical diagnoses for conditions, for example, of the kidney, ureter, and bladder
It should be harder to characterize an anesthesiologist, whose duties are more cross-cutting with respect to diagnoses, concerning essentially all conditions related to surgeries
De facto diagnosis specialties are highly predictable by the diagnoses inherent in EHRs
5
There is no ground truth to determine the validity of a discovered de facto diagnosis specialty
A discovered de facto diagnosis specialty can be recognized by classifiers as accurately as the existing listed diagnosis specialties
We discover de facto diagnosis specialties that do not have corresponding codes in HPTCS
6
The users (providers) can be likened to readers of documents, where there is an archive of documents in which the words in each document correspond to diagnoses
Users with specialties are groups of readers who have a common de facto diagnosis specialty and interest in the same group
To solve the de facto diagnosis specialty discovery problem we aim to develop a classifier that characterizes this common interest in terms of the documents that they have read
We can think of users as readers of documents whose words are diagnoses
7
We use access log data from a hospital and combine it with the diagnosis lists in patient discharge records
Fine-grained data set A small portion of the data has an explicit
mapping between users and diagnoses of the EHRs they accessed
General data set (more representative of the challenging scenarios encountered in practice) The entire data after removal of all the fine-
grained mapping information
We study two data sets from Northwestern Memorial Hospital
8
A few taxonomy codes account for the majority of specialists in the data sets
9
The ICD-9 codes for diagnoses are mapped down to 603 Clinical Classification Software (CCS) codes
NPI taxonomy codes with fewer than 20 user instances are filtered out
Based on the guidance of clinicians and hospital administrators, we further identify 12 NPI taxonomy codes as diagnosis specialties (core NPI taxonomy codes) Obstetrics & Gynecology, Cardiovascular Disease,
Neurology, Ophthalmology, Gastroenterology, Dermatology, Orthopaedic Surgery, Neonatal-Perinatal Medicine, Infectious Disease, Pulmonary Disease, Neurological Surgery, and Urology.
We identify 12 NPI taxonomy codes as diagnosis specialties
10
We invoke machine learning to discover potential de facto diagnosis specialties in the data set that lack corresponding codes in the HPTCS A semi-supervised learning model (PathSelClus) for fine-
grained data set An unsupervised learning model (LDA) for larger general data
set We use supervised learning models to evaluate the
recognition accuracy of the discovered specialty by comparing our approach with the existing listed diagnosis specialties (12 core NPI taxonomy codes) Ideally, their recognition accuracy should be similar Such recognition accuracy is evaluated by four classifiers:
decision trees, random forests, PCA-KNN, and SVM
We solve the problem under a general discovery-evaluation framework
11
A heterogeneous information network consists of multiple types of objects and/or multiple types of links.
Link-based clustering in heterogeneous information networks groups objects based on their connections to other objects in the networks
Two meta-paths in our model: User (access) -> Patient (accessed by) -> User User (access) -> Patient (diagnosed with) -> Diagnosis
(assigned to) -> Patient (accessed by) -> User
PathSelClus is used for discovery in the fine-grained data set
12
PathSelClus provides semi-supervised learning on heterogeneous information networks
13
In practice, fine-grained data sets may not be available for PathSelClus. Hence, we also employ LDA, an unsupervised learning method based on topic modeling
In LDA, topics act as summaries of different themes pervasive in the corpus and documents are characterized with respect to these topics
We associate users with diagnoses via the patients they access
Latent Dirichlet Allocation (LDA) is used for discovery in the general data set
14
After applying LDA, each user is assigned to an allocation in the specialty topic simplex
A higher frequency in a specialty indicates that the user is more likely to access patients with diagnoses popular in that specialty
We cluster users by the closest specialties because this specialty has the highest proportion in the specialty topic simplex
LDA provides unsupervised clustering of users
15
Four classifiers are used for evaluation
Features: we map each user to a term frequency-inverse document frequency (TF-IDF) weighted diagnosis vectors TF: number of times that a user has accessed patients
with a diagnosis IDF: inverse of the number of users that have accessed
patients with a diagnosis Classifiers:
Decision trees Random forests KNN-PCA SVM
The de facto diagnosis specialty Breast Cancer is discovered by PathSelClus
It is represented by the top 10 most accessed diagnoses by all the users that are associated with the Breast Cancer specialty
De facto diagnosis specialties Breast Cancer and Obesity are discovered by LDA
They are represented by 10 most probable diagnoses respectively as an output of LDA
The recognition accuracy of the discovered de facto diagnosis specialty and the ones listed in HPTCS are similar
Evaluation of the Breast Cancer specialty discovered by PathSelClus
Evaluation of the Breast Cancer specialty discovered by LDA
Evaluation of the Obesity specialty discovered by LDA
• P: Precision, R: Recall, F1: F1 Score• All values are in percentage• Boldfaced results indicate significant improvement (5x2 cross-validation & paired t-test with p < 0.05)
In conclusion, de facto diagnosis specialties can be discovered systematically
Medical specialties are useful, but are often inconsistent with actual diagnosis histories Even National Provider Identifiers are not always
accurate Machine learning can be leveraged to
discover and evaluate de facto diagnosis specialties, such as Breast Cancer and Obesity Semi-supervised and unsupervised learning are
used for discovery Supervised learning are used for evaluation