ohdsi cdm presentation

Transforming the 2.33M-patient Medicare synthetic public use files to the OMOP CDM v5:

ETL-CMS software and processed data available and feature-complete

Christophe G. Lambert, PhD1, Praveen Kumar2, Amritansh2

1Center for Global Health, Division of Translational Informatics, Dept. of Internal Medicine. 2Dept. of Computer Science. University of New Mexico, Albuquerque, NM.

Overview• The need for an open dataset• The Medicare DE-SynPUF data• The ETL-CMS project• Overview of the extract-transform-load process• Loading the data into an OMOP CDMv5 database• Atlas views of the DE-SynPUF data

The need for an open dataset• Most EHR/claims databases are not free or open

– Student use of licensed or PHI data often disallowed– IRB review hurdles for access

• Until now open datasets have been small• OHDSI tools are inaccessible without OMOP

CDM data!

Benefits of open dataset• Freely accessible to anyone interested in

observational research• No data privacy concerns• Can serve as a testbed for methods

–Everyone can have the same data (reproducibility)–Data quality verified by Achilles Heel

• Eases trial of new users to OHDSI tools

The Medicare DE-SynPUF data• Data Entrepreneurs’ Synthetic Public Use File• 2.33M synthetic patients based on real Medicare

claims data–Years covered: 2008-2010–Similar format to real claims data obtainable from

http://www.resdac.org –Contains drugs, procedures, visits, conditions, providers,

costs, deaths, patient demographics.

http://www.resdac.org/

The ETL-CMS project• Project initiated Feb. 2015 to convert DE-SynPUF

data to OMOP CDM v5–Started by CMS working group of OHDSI community–Repository: https://github.com/OHDSI/ETL-CMS –Python-based partial implementation

• UNM researches resumed Dec. 2015-June 2016–~6 man-months to complete the ETL–Detailed documentation available–Data download available: ftp://ftp.ohdsi.org/synpuf

https://github.com/OHDSI/ETL-CMS

ftp://ftp.ohdsi.org/synpuf

Github repository

Documentation

Improvements made• All OMOP CDMv5 database tables now populated

(to the extent the SynPUF data allows)– visit_occurrence, payer_plan_period, location, care_site etc.– Empty tables: device_cost, specimen, visit_cost, note.

• No data in the output CDMv5 csv files violates the defined database constraints.

• All 20 synpuf parts consistently reference shared information (provider, care_site, location, etc.)

Improvements made• Improved logic for concept mapping

– Deprecated concepts– Handle 1-many mappings– Unmapped concepts

• Input data sorting consistent across platforms• Log file is created for the input records with

undefined ICD9/HCPCS/NDC codes

Caveats• The output data has limits on its inferential

research value– Synthetic data derived from real data– Modifications from real data are undocumented

• Trade-offs made for certain transformations:– Visit dates (drugs not assigned to visits)– Observation periods (defined by earliest and latest event)– Payer plan periods (complex, see documentation)

Caveats• Input DE-SynPuf records with undefined

ICD9/HCPCS/NDC codes are not processed– Some appear to be typos– Some appear to be real but non-standard codes (04.22)

• 6% of drug_exposure quantity and days_supply 0– Derived dose_era table therefore left empty

• Location information uses SSA codes– Converted to 2-letter state codes (not to spec)– All non-states lumped into code “54”

Running the extract-transform-load (ETL) process0. Shortcut: download the ready-to-go data and vocabulary files (3) and skip to step 71. Install required software2. Download SynPUF input data3. Download CDMv5 Vocabulary files4. Setup the .env file to specify file locations5. Test ETL with DE_0 CMS test data6. Run ETL on CMS data7. Load data into the database8. Create ERA tables9. Open issues and caveats with the ETL

0. Download the ready-to-go data• ftp://ftp.ohdsi.org/synpuf • ~18GB download of compressed .csv files

–synpuf_1.zip: Tables for first 1/20th of the data–Remaining files are individually zipped tables for the full

2.33M patients.• Retrieve and unzip

–Synpuf_1.zip tables: tablename_1.csv–Full ETL tables: named after table names

ftp://ftp.ohdsi.org/synpuf

3. Download CDMv5 Vocabulary files• Download vocabulary files from

http://www.ohdsi.org/web/athena/– Select at minimum, the following vocabularies- SNOMED,

ICD9CM, ICD9Proc, CPT4, HCPCS, LOINC, RxNorm, and NDC– Can take several hours to download

• Unzip the files in a directory• Add CPT4 concepts

http://www.ohdsi.org/web/athena/

Add CPT4 concepts• CPT4 concepts have to be fetched separately• Run: java -jar cpt4.jar 5 (can take hours)• Concepts appended to CONCEPT.csv

Edit SQL files• SQL files located in ETL-CMS/SQL folder• Replace synpuf5 schema with target name in all .sql files• Set path to data location in: load_CDMv5_synpuf.sql

– Note: synpuf_1.zip (1/20th subset) has filenames: tablename_1.csv• Set path to vocabulary location in: load_CDMv5_vocabulary.sql

COPY synpuf5.CARE_SITE FROM '/home/lambert/CMS/care_site_1.csv' WITH DELIMITER E',' CSV HEADER QUOTE E'\b';

Create database> psql -f create_CDMv5_tables.sql

CREATE TABLE synpuf5.observation_period(

observation_period_id INTEGER NOT NULL , person_id INTEGER NOT NULL , observation_period_start_date DATE NOT NULL , observation_period_end_date DATE NOT NULL , period_type_concept_id INTEGER NOT NULL

);

Load data> psql -f load_CDMv5_synpuf.sql

(loads all the synpuf data csv files)

> psql -f load_CDMv5_vocabulary.sql (loads all the standardized vocabulary csv files)

Finalize loadCreate constraints: > psql -f create_CDMv5_constraints.sql Create indices> psql -f create_CDMv5_indices.sqlCreate eras> psql -f create_CDMv5_condition_era.sql (~4hrs)> psql -f create_CDMv5_drug_era_non_stockpile.sql (~3hrs)

Achilles • Automated Characterization of Health Information at Large-

scale Longitudinal Evidence Systems – https://github.com/OHDSI/Achilles – Characterization, quality assessment and visualization of observational

health databases. – Assess patient demographics, prevalence of conditions, drugs and

procedures– provides patient level anonymity

https://github.com/OHDSI/Achilles

Achilles setup

• Follow Achilles instructions– https://github.com/OHDSI/Achilles/blob/master/README.md

• Run Achilles analysis via Achilles R on the CDMv5 database– Run Achilles() with connection string, database_name, schema_name,

vocabulary, port to be used– It generates analysis and stores them in the results schema

• Export analysis results into JSON format to be used by AchillesWeb.– Run ExportToJSON() with path to output json files

• Host it on a web server or in conjunction with Atlas

https://github.com/OHDSI/Achilles/blob/master/README.md

Running Atlas with the synthetic data

•ATLAS is an open source Web-based interface to a (growing) subset of the OHDSI tools: https://github.com/OHDSI/Atlas• ATLAS is developed using HTML, CSS and Javascript and can be

deployed on a local web server–Update the config.js file to point to current active OHDSI

WebAPI deployment.

https://github.com/OHDSI/Atlas

Atlas views of the DE-SynPUF data

Dashboard

Achilles Heel report

Observation Periods

• Age at First Observation• Age by Gender• Observation Length• Duration by Gender• Cumulative Observation• Duration by Age decile

Measurement prevalence treemap

Concept sets in Atlas

Cohort builder in Atlas

Acknowledgements• Recent University of New Mexico contributors

–Praveen Kumar @Praveen_Kumar, Department of Computer Science–Amritansh @Amritansh, Department of Computer Science

• Past contributors–Don O'Hara @donohara, Evidera–Ryan Duryea @aguynamedryan, Outcomes Insights, Inc.–Jennifer Duryea @jenniferduryea, Outcomes Insights, Inc.–Claire Cangialose @claire-oi, Outcomes Insights, Inc.–Erica Voss @ericaVoss, Janssen Research and Development–Patrick Ryan @Patrick_Ryan, Janssen Research and Development

• Christian Reich for help with the OMOP vocabulary• Chris Knoll and Anthony Sena for help with Atlas configuration• All of the contributors to the growing OHDSI software ecosystem

ohdsi cdm presentation

Documents