ohdsi cdm presentation
TRANSCRIPT
Transforming the 2.33M-patient Medicare synthetic public use files to the OMOP CDM v5:
ETL-CMS software and processed data available and feature-complete
Christophe G. Lambert, PhD1, Praveen Kumar2, Amritansh2
1Center for Global Health, Division of Translational Informatics, Dept. of Internal Medicine. 2Dept. of Computer Science. University of New Mexico, Albuquerque, NM.
Overview• The need for an open dataset• The Medicare DE-SynPUF data• The ETL-CMS project• Overview of the extract-transform-load process• Loading the data into an OMOP CDMv5 database• Atlas views of the DE-SynPUF data
The need for an open dataset• Most EHR/claims databases are not free or open
– Student use of licensed or PHI data often disallowed– IRB review hurdles for access
• Until now open datasets have been small• OHDSI tools are inaccessible without OMOP
CDM data!
Benefits of open dataset• Freely accessible to anyone interested in
observational research• No data privacy concerns• Can serve as a testbed for methods
–Everyone can have the same data (reproducibility)–Data quality verified by Achilles Heel
• Eases trial of new users to OHDSI tools
The Medicare DE-SynPUF data• Data Entrepreneurs’ Synthetic Public Use File• 2.33M synthetic patients based on real Medicare
claims data–Years covered: 2008-2010–Similar format to real claims data obtainable from
http://www.resdac.org –Contains drugs, procedures, visits, conditions, providers,
costs, deaths, patient demographics.
The ETL-CMS project• Project initiated Feb. 2015 to convert DE-SynPUF
data to OMOP CDM v5–Started by CMS working group of OHDSI community–Repository: https://github.com/OHDSI/ETL-CMS –Python-based partial implementation
• UNM researches resumed Dec. 2015-June 2016–~6 man-months to complete the ETL–Detailed documentation available–Data download available: ftp://ftp.ohdsi.org/synpuf
Github repository
Documentation
Improvements made• All OMOP CDMv5 database tables now populated
(to the extent the SynPUF data allows)– visit_occurrence, payer_plan_period, location, care_site etc.– Empty tables: device_cost, specimen, visit_cost, note.
• No data in the output CDMv5 csv files violates the defined database constraints.
• All 20 synpuf parts consistently reference shared information (provider, care_site, location, etc.)
Improvements made• Improved logic for concept mapping
– Deprecated concepts– Handle 1-many mappings– Unmapped concepts
• Input data sorting consistent across platforms• Log file is created for the input records with
undefined ICD9/HCPCS/NDC codes
Caveats• The output data has limits on its inferential
research value– Synthetic data derived from real data– Modifications from real data are undocumented
• Trade-offs made for certain transformations:– Visit dates (drugs not assigned to visits)– Observation periods (defined by earliest and latest event)– Payer plan periods (complex, see documentation)
Caveats• Input DE-SynPuf records with undefined
ICD9/HCPCS/NDC codes are not processed– Some appear to be typos– Some appear to be real but non-standard codes (04.22)
• 6% of drug_exposure quantity and days_supply 0– Derived dose_era table therefore left empty
• Location information uses SSA codes– Converted to 2-letter state codes (not to spec)– All non-states lumped into code “54”
Running the extract-transform-load (ETL) process0. Shortcut: download the ready-to-go data and vocabulary files (3) and skip to step 71. Install required software2. Download SynPUF input data3. Download CDMv5 Vocabulary files4. Setup the .env file to specify file locations5. Test ETL with DE_0 CMS test data6. Run ETL on CMS data7. Load data into the database8. Create ERA tables9. Open issues and caveats with the ETL
0. Download the ready-to-go data• ftp://ftp.ohdsi.org/synpuf • ~18GB download of compressed .csv files
–synpuf_1.zip: Tables for first 1/20th of the data–Remaining files are individually zipped tables for the full
2.33M patients.• Retrieve and unzip
–Synpuf_1.zip tables: tablename_1.csv–Full ETL tables: named after table names
3. Download CDMv5 Vocabulary files• Download vocabulary files from
http://www.ohdsi.org/web/athena/– Select at minimum, the following vocabularies- SNOMED,
ICD9CM, ICD9Proc, CPT4, HCPCS, LOINC, RxNorm, and NDC– Can take several hours to download
• Unzip the files in a directory• Add CPT4 concepts
Add CPT4 concepts• CPT4 concepts have to be fetched separately• Run: java -jar cpt4.jar 5 (can take hours)• Concepts appended to CONCEPT.csv
Edit SQL files• SQL files located in ETL-CMS/SQL folder• Replace synpuf5 schema with target name in all .sql files• Set path to data location in: load_CDMv5_synpuf.sql
– Note: synpuf_1.zip (1/20th subset) has filenames: tablename_1.csv• Set path to vocabulary location in: load_CDMv5_vocabulary.sql
COPY synpuf5.CARE_SITE FROM '/home/lambert/CMS/care_site_1.csv' WITH DELIMITER E',' CSV HEADER QUOTE E'\b';
Create database> psql -f create_CDMv5_tables.sql
CREATE TABLE synpuf5.observation_period(
observation_period_id INTEGER NOT NULL , person_id INTEGER NOT NULL , observation_period_start_date DATE NOT NULL , observation_period_end_date DATE NOT NULL , period_type_concept_id INTEGER NOT NULL
);
Load data> psql -f load_CDMv5_synpuf.sql
(loads all the synpuf data csv files)
> psql -f load_CDMv5_vocabulary.sql (loads all the standardized vocabulary csv files)
Finalize loadCreate constraints: > psql -f create_CDMv5_constraints.sql Create indices> psql -f create_CDMv5_indices.sqlCreate eras> psql -f create_CDMv5_condition_era.sql (~4hrs)> psql -f create_CDMv5_drug_era_non_stockpile.sql (~3hrs)
Achilles • Automated Characterization of Health Information at Large-
scale Longitudinal Evidence Systems – https://github.com/OHDSI/Achilles – Characterization, quality assessment and visualization of observational
health databases. – Assess patient demographics, prevalence of conditions, drugs and
procedures– provides patient level anonymity
Achilles setup
• Follow Achilles instructions– https://github.com/OHDSI/Achilles/blob/master/README.md
• Run Achilles analysis via Achilles R on the CDMv5 database– Run Achilles() with connection string, database_name, schema_name,
vocabulary, port to be used– It generates analysis and stores them in the results schema
• Export analysis results into JSON format to be used by AchillesWeb.– Run ExportToJSON() with path to output json files
• Host it on a web server or in conjunction with Atlas
Running Atlas with the synthetic data
•ATLAS is an open source Web-based interface to a (growing) subset of the OHDSI tools: https://github.com/OHDSI/Atlas• ATLAS is developed using HTML, CSS and Javascript and can be
deployed on a local web server–Update the config.js file to point to current active OHDSI
WebAPI deployment.
Atlas views of the DE-SynPUF data
Dashboard
Achilles Heel report
Observation Periods
• Age at First Observation• Age by Gender• Observation Length• Duration by Gender• Cumulative Observation• Duration by Age decile
Measurement prevalence treemap
Concept sets in Atlas
Cohort builder in Atlas
Acknowledgements• Recent University of New Mexico contributors
–Praveen Kumar @Praveen_Kumar, Department of Computer Science–Amritansh @Amritansh, Department of Computer Science
• Past contributors–Don O'Hara @donohara, Evidera–Ryan Duryea @aguynamedryan, Outcomes Insights, Inc.–Jennifer Duryea @jenniferduryea, Outcomes Insights, Inc.–Claire Cangialose @claire-oi, Outcomes Insights, Inc.–Erica Voss @ericaVoss, Janssen Research and Development–Patrick Ryan @Patrick_Ryan, Janssen Research and Development
• Christian Reich for help with the OMOP vocabulary• Chris Knoll and Anthony Sena for help with Atlas configuration• All of the contributors to the growing OHDSI software ecosystem