ohdsi cdm presentation

30
Transforming the 2.33M-patient Medicare synthetic public use files to the OMOP CDM v5: ETL-CMS software and processed data available and feature-complete Christophe G. Lambert, PhD 1 , Praveen Kumar 2 , Amritansh 2 1 Center for Global Health, Division of Translational Informatics, Dept. of Internal Medicine. 2 Dept. of Computer Science. University of New Mexico, Albuquerque, NM.

Upload: amritansh-

Post on 17-Jan-2017

36 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: OHDSI CDM Presentation

Transforming the 2.33M-patient Medicare synthetic public use files to the OMOP CDM v5:

ETL-CMS software and processed data available and feature-complete

Christophe G. Lambert, PhD1, Praveen Kumar2, Amritansh2

1Center for Global Health, Division of Translational Informatics, Dept. of Internal Medicine. 2Dept. of Computer Science. University of New Mexico, Albuquerque, NM.

Page 2: OHDSI CDM Presentation

Overview• The need for an open dataset• The Medicare DE-SynPUF data• The ETL-CMS project• Overview of the extract-transform-load process• Loading the data into an OMOP CDMv5 database• Atlas views of the DE-SynPUF data

Page 3: OHDSI CDM Presentation

The need for an open dataset• Most EHR/claims databases are not free or open

– Student use of licensed or PHI data often disallowed– IRB review hurdles for access

• Until now open datasets have been small• OHDSI tools are inaccessible without OMOP

CDM data!

Page 4: OHDSI CDM Presentation

Benefits of open dataset• Freely accessible to anyone interested in

observational research• No data privacy concerns• Can serve as a testbed for methods

–Everyone can have the same data (reproducibility)–Data quality verified by Achilles Heel

• Eases trial of new users to OHDSI tools

Page 5: OHDSI CDM Presentation

The Medicare DE-SynPUF data• Data Entrepreneurs’ Synthetic Public Use File• 2.33M synthetic patients based on real Medicare

claims data–Years covered: 2008-2010–Similar format to real claims data obtainable from

http://www.resdac.org –Contains drugs, procedures, visits, conditions, providers,

costs, deaths, patient demographics.

Page 6: OHDSI CDM Presentation

The ETL-CMS project• Project initiated Feb. 2015 to convert DE-SynPUF

data to OMOP CDM v5–Started by CMS working group of OHDSI community–Repository: https://github.com/OHDSI/ETL-CMS –Python-based partial implementation

• UNM researches resumed Dec. 2015-June 2016–~6 man-months to complete the ETL–Detailed documentation available–Data download available: ftp://ftp.ohdsi.org/synpuf

Page 7: OHDSI CDM Presentation

Github repository

Page 8: OHDSI CDM Presentation

Documentation

Page 9: OHDSI CDM Presentation

Improvements made• All OMOP CDMv5 database tables now populated

(to the extent the SynPUF data allows)– visit_occurrence, payer_plan_period, location, care_site etc.– Empty tables: device_cost, specimen, visit_cost, note.

• No data in the output CDMv5 csv files violates the defined database constraints.

• All 20 synpuf parts consistently reference shared information (provider, care_site, location, etc.)

Page 10: OHDSI CDM Presentation

Improvements made• Improved logic for concept mapping

– Deprecated concepts– Handle 1-many mappings– Unmapped concepts

• Input data sorting consistent across platforms• Log file is created for the input records with

undefined ICD9/HCPCS/NDC codes

Page 11: OHDSI CDM Presentation

Caveats• The output data has limits on its inferential

research value– Synthetic data derived from real data– Modifications from real data are undocumented

• Trade-offs made for certain transformations:– Visit dates (drugs not assigned to visits)– Observation periods (defined by earliest and latest event)– Payer plan periods (complex, see documentation)

Page 12: OHDSI CDM Presentation

Caveats• Input DE-SynPuf records with undefined

ICD9/HCPCS/NDC codes are not processed– Some appear to be typos– Some appear to be real but non-standard codes (04.22)

• 6% of drug_exposure quantity and days_supply 0– Derived dose_era table therefore left empty

• Location information uses SSA codes– Converted to 2-letter state codes (not to spec)– All non-states lumped into code “54”

Page 13: OHDSI CDM Presentation

Running the extract-transform-load (ETL) process0. Shortcut: download the ready-to-go data and vocabulary files (3) and skip to step 71. Install required software2. Download SynPUF input data3. Download CDMv5 Vocabulary files4. Setup the .env file to specify file locations5. Test ETL with DE_0 CMS test data6. Run ETL on CMS data7. Load data into the database8. Create ERA tables9. Open issues and caveats with the ETL

Page 14: OHDSI CDM Presentation

0. Download the ready-to-go data• ftp://ftp.ohdsi.org/synpuf • ~18GB download of compressed .csv files

–synpuf_1.zip: Tables for first 1/20th of the data–Remaining files are individually zipped tables for the full

2.33M patients.• Retrieve and unzip

–Synpuf_1.zip tables: tablename_1.csv–Full ETL tables: named after table names

Page 15: OHDSI CDM Presentation

3. Download CDMv5 Vocabulary files• Download vocabulary files from

http://www.ohdsi.org/web/athena/– Select at minimum, the following vocabularies- SNOMED,

ICD9CM, ICD9Proc, CPT4, HCPCS, LOINC, RxNorm, and NDC– Can take several hours to download

• Unzip the files in a directory• Add CPT4 concepts

Page 16: OHDSI CDM Presentation

Add CPT4 concepts• CPT4 concepts have to be fetched separately• Run: java -jar cpt4.jar 5 (can take hours)• Concepts appended to CONCEPT.csv

Page 17: OHDSI CDM Presentation

Edit SQL files• SQL files located in ETL-CMS/SQL folder• Replace synpuf5 schema with target name in all .sql files• Set path to data location in: load_CDMv5_synpuf.sql

– Note: synpuf_1.zip (1/20th subset) has filenames: tablename_1.csv• Set path to vocabulary location in: load_CDMv5_vocabulary.sql

COPY synpuf5.CARE_SITE FROM '/home/lambert/CMS/care_site_1.csv' WITH DELIMITER E',' CSV HEADER QUOTE E'\b';

Page 18: OHDSI CDM Presentation

Create database> psql -f create_CDMv5_tables.sql

CREATE TABLE synpuf5.observation_period(

observation_period_id INTEGER NOT NULL , person_id INTEGER NOT NULL , observation_period_start_date DATE NOT NULL , observation_period_end_date DATE NOT NULL , period_type_concept_id INTEGER NOT NULL

);

Page 19: OHDSI CDM Presentation

Load data> psql -f load_CDMv5_synpuf.sql

(loads all the synpuf data csv files)

> psql -f load_CDMv5_vocabulary.sql (loads all the standardized vocabulary csv files)

Page 20: OHDSI CDM Presentation

Finalize loadCreate constraints: > psql -f create_CDMv5_constraints.sql Create indices> psql -f create_CDMv5_indices.sqlCreate eras> psql -f create_CDMv5_condition_era.sql (~4hrs)> psql -f create_CDMv5_drug_era_non_stockpile.sql (~3hrs)

Page 21: OHDSI CDM Presentation

Achilles • Automated Characterization of Health Information at Large-

scale Longitudinal Evidence Systems – https://github.com/OHDSI/Achilles – Characterization, quality assessment and visualization of observational

health databases. – Assess patient demographics, prevalence of conditions, drugs and

procedures– provides patient level anonymity

Page 22: OHDSI CDM Presentation

Achilles setup

• Follow Achilles instructions– https://github.com/OHDSI/Achilles/blob/master/README.md

• Run Achilles analysis via Achilles R on the CDMv5 database– Run Achilles() with connection string, database_name, schema_name,

vocabulary, port to be used– It generates analysis and stores them in the results schema

• Export analysis results into JSON format to be used by AchillesWeb.– Run ExportToJSON() with path to output json files

• Host it on a web server or in conjunction with Atlas

Page 23: OHDSI CDM Presentation

Running Atlas with the synthetic data

•ATLAS is an open source Web-based interface to a (growing) subset of the OHDSI tools: https://github.com/OHDSI/Atlas• ATLAS is developed using HTML, CSS and Javascript and can be

deployed on a local web server–Update the config.js file to point to current active OHDSI

WebAPI deployment.

Page 24: OHDSI CDM Presentation

Atlas views of the DE-SynPUF data

Dashboard

Page 25: OHDSI CDM Presentation

Achilles Heel report

Page 26: OHDSI CDM Presentation

Observation Periods

• Age at First Observation• Age by Gender• Observation Length• Duration by Gender• Cumulative Observation• Duration by Age decile

Page 27: OHDSI CDM Presentation

Measurement prevalence treemap

Page 28: OHDSI CDM Presentation

Concept sets in Atlas

Page 29: OHDSI CDM Presentation

Cohort builder in Atlas

Page 30: OHDSI CDM Presentation

Acknowledgements• Recent University of New Mexico contributors

–Praveen Kumar @Praveen_Kumar, Department of Computer Science–Amritansh @Amritansh, Department of Computer Science

• Past contributors–Don O'Hara @donohara, Evidera–Ryan Duryea @aguynamedryan, Outcomes Insights, Inc.–Jennifer Duryea @jenniferduryea, Outcomes Insights, Inc.–Claire Cangialose @claire-oi, Outcomes Insights, Inc.–Erica Voss @ericaVoss, Janssen Research and Development–Patrick Ryan @Patrick_Ryan, Janssen Research and Development

• Christian Reich for help with the OMOP vocabulary• Chris Knoll and Anthony Sena for help with Atlas configuration• All of the contributors to the growing OHDSI software ecosystem