linking mortality and inpatient discharge records: comparing deterministic and probabilistic...

29
Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics Mike Yuan Bureau of Community Health Promotion Wisconsin Division of Public Health June 2011

Upload: lambert-jacobs

Post on 16-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics

Linking Mortality and Inpatient Discharge Records:

Comparing Deterministic and Probabilistic Methodologies

Richard MillerOffice of Health Informatics

Mike YuanBureau of Community Health Promotion

Wisconsin Division of Public Health

June 2011

Page 2: Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics

Linking (matching) Mortality Records and Inpatient

Discharge Records

Why Combine Mortality Records and Inpatient Discharge Records?

How to link or match recordsMethod 1: Deterministic record linkageMethod 2: Probabilistic record linkage

How do the results compare?

Lessons learned

Page 3: Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics

Why Combine Mortality Records and Inpatient Discharge Records?

Improve surveillance of CVD and other chronic diseases

Enhanced surveillance analysis opportunities

– Mortality records capture CVD only if an underlying or contributing cause– Inpatient records capture CVD treated in that setting, but the case history

ends at discharge

Capture hospital record information on demographics, co-morbidities, complications, and surgical procedures.

Measure treatment outcomes on a population basis

Page 4: Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics

The Time Frame for Linked Records

Analyses are more complete the more time there is to find a death record following a hospitalization

The scale of mortality and inpatient records in Wisconsin:

2 million inpatient discharge records 2006-08Smaller number of individual patients

140,000 mortality records 2006-08

How to find matching records? How to define links between records?

Page 5: Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics

False Positives and Negatives

Matching records involves finding a balance between false positive and false negative matches.

False positive matches combine records for different people.

False negatives fail to include all persons in the dataset of matched records – possibly introducing bias.

Page 6: Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics

Method 1. Deterministic Record Linkage

Pairs of records are compared for exactly matching indentifying information. Exact matches determine true record matches.

Works perfectly only if information that uniquely identifies the same individual in two datasets is available, is captured perfectly, and is recorded perfectly

In real world data systems: – uniquely identifying elements often not available; – recorded data have small differences between records– some records have some fields with missing values.

Page 7: Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics

Method 2. Probabilistic Record Linkage

Every pair of records has some probability of being a “true match.”

Specialized software estimates that probability by applying statistical principles and tools.

Set some threshold for “high probability matches” A common criterion is 0.9 probability of being a true match This defines the risk of accepting false positives

Some methods impute missing matches to pairs that look unlikely due to possible reporting and recording errors.

Page 8: Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics

Part I. Deterministic Linkage among Inpatient Records

Identifying Patients = de-duplicating inpatient records

Method: Iterative application of combinations of elements with person-matching face validity.

Available fields:• Initials• 3-digit encryption of last name (Miller = M460)• Date of birth• Gender• ZIP code of residence • Insurance ID >> “SSN-like string”• Hospital and medical record number

Page 9: Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics

Part I: Deterministic Linkage among Inpatient Records Uniqueness of Patient Identifiers

Wisconsin Inpatients Discharged 2006-08, N=2,017,339

“Patient” Identifier

% Records with identifier

% with unique values

Initials + DOB + sex 100% 56.2%

Initials + DOB + sex + ZIP 99.9% 63.4

Policy number + DOB + sex 92.1% 64.7

SSN-like string + DOB + sex 78.2% 61.2

Hospital + medical record number

99.7% 70.9

Page 10: Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics

Part I: Deterministic Linkage among Inpatient Records

Record links were evaluated by looking for three indicators of false positive matches:

1. Any later admission date preceding the earliest admission’s discharge date.

2. Any admission date preceding the previous admission’s discharge date.

3. Records indicating the patient died but patient has later hospitalizations.

Page 11: Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics

Part II: Deterministic Matching of Patients to Mortality Records

Matches between the 1,280,000 resident patients and the 135,000 Wisconsin occurrence deaths to residents.

Which inpatient record? The most recent one…

Iterative procedures use a succession of identifiers (combinations of the available data elements). • Construct a linking identifier• Select records with unique values of the “linker” • Sort each set by that linking identifier• Matching and merge those records with identical linker values • Collect the remaining records • Construct an alternative linking combination • Repeat until plausible linking combinations have been exhausted.

Page 12: Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics

Part II: Deterministic Matching of Patients to Mortality Records

Iterative matching in two phases:

I. Match the records for in- hospital deaths Less time between events and more data elements in common Date of death = discharge date Hospital is match element 25% of deaths; 2% of inpatients.

II. Examine the remaining records for matches

Page 13: Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics

Part II: Deterministic Matching of Inpatient Records to Mortality Records

Phase I. Linked In-Hospital Deaths

Linker # Pairs Matched

Matched Records Remaining Unmatched Records

% of inpatient records

% of mortality records

# of inpatient records

# of mortality records

All In-Hospital Deaths 32,816 35,745

Initials + DOB + Sex + ZIP 26,022 79% 73% 6,794 9,723

Initials + DOB + Sex + SSN 2,666 8 7 4,128 7,057

Initials + Sex + ZIP3 + DOD 1,496 4 4 2,632 5,561

Hospital + DOD + DOB 833 2 2 1,799 4,728

Initials + Sex + DOB 37 0.1 0.1 1,762 4,691

All Linked Pairs 31,054 94.6% 86.9%

Page 14: Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics

Part II: Deterministic Matching of Inpatient Records to Mortality Records

Phase 2: Linked Residual Deaths and Patients

Linker # Pairs Matched

Matched Records Remaining Unmatched Records

% of inpatient records

% of mortality records

# of inpatient records

# of mortality records

Residual Deaths 1,195,638 104,023

Initials + DOB + Sex + ZIP 53,059 4% 51% 1,142,579 50,964

Initials + DOB + Sex + SSN

5,514 <1 11 1,137,065 45,450

All Residual Linked Pairs 58,573 4.9% 56.3%

Page 15: Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics

Part II: Deterministic Matching of Inpatient Records to Mortality Records

Combined results:

Linked 66% of the mortality records to a hospital patient• 89,627 of the 135,077 total 2006-08 resident and occurrence

deaths

Evaluated results with logic tests• Admission date after previous discharge date

• Not hospitalized again after discharged ‘expired’

• Agreement rates among other data elements

Page 16: Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics

Part III: Probabilistic Matching of Inpatient Records to Mortality Records

A “probabilistic record linkage methodology” recognizes that a pair of records has some probability of being a “true match.”

Specialized software products estimate that probability:• LinkSolv – our choice• LinkPlus• LinkPro

LinkSolv is based on Bayesian statistics as applied by Fellegi and Sunter and considerably developed by Dr. Michael McGlincy, the software developer.

Page 17: Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics

Part III: Probabilistic Matching of Inpatient Records to Mortality Records

LinkSolv compares pairs of fields, incorporating a number of adjustments to account for real-world violations of statistical assumptions:

• The probability that apparently different values may both be correct; • Rates of missing data;• Estimated rates of reporting errors; and• Discounting some weights for matching/mismatching values if

agreements/disagreements on one field are related to agreements/disagreements on another.

Comparisons may be for exact matches or acceptable differences

Page 18: Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics

Part III: Probabilistic Matching of Inpatient Records to Mortality Records

Some simplifying decisions:

Use the most recent inpatient discharge identified by the deterministic linkage process

Drop the 30% of patients who are mothers and their newborns

Work only with the patients whose last hospitalization was in 2006

Page 19: Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics

Part III: Probabilistic Matching of Inpatient Records to Mortality Records

Experimented with comparison fields:

• Disaggregate birth date or not?• Break up ZIP in ZIP-3 and ZIP-2 components or not?• Break up name into separate initials and encrypted field?• Use full SSN or just last 4 digits (SSN-4)?• Use elements only available for the in-hospital deaths?

Page 20: Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics

Part III: Probabilistic Matching of Inpatient Records to Mortality Records

Final model was relatively simple:

• Last initial + encryption (Miller = M460)• First initial• SSN-4• Date of birth as one field• Gender M/F• ZIP-3

Page 21: Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics

Part III: Probabilistic Matching of Inpatient Records to Mortality Records

This model was applied to three over-lapping subsets of records, along with estimated corrections to statistical assumptions.

We merged the three linkage passes in a multiple imputation process that

applies Markov Chain-Monte Carlo techniques to create five alternative sets of paired records.

– Identifies additional record pairs that have a low - but real - probability of being true matches, due to possible measurement errors.

For evaluation purposes, we de-duplicated these 5 sets to identify a final set of 36,562 inpatient-mortality records linked with probabilistic methods.

Page 22: Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics

Comparison of Results Combined Linked Pairs

How Linked? Number of Linked Pairs

% of Deterministic Matches

% of Probabilistic Matches

Both Methods 31,367 93% 86% Same Death Record Matched to Different

Patient Records 636 2 2

Probabilistic Only 4,559 -- 12 Deterministic Only 1,673 5 --

Total 38,235 100% (33,676) 100% (36,562)

93% of deterministic matches were confirmed by the probabilistic matches 14% of probabilistic matches were not captured by deterministic linking.

Page 23: Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics

Comparison of Results

Evaluating the discrepant results:

High-probability matches not found in the deterministic matches. The most common issue was discrepancies in the last two ZIP digits.

Low-probability matches• 2% of the record pairs identified by both methods were evaluated by LinkSolv as having a low probability of being a true match. • This suggests that some deterministic criteria are weaker than would be desirable, notably last name encryption and SSN.

Deterministic matches not confirmed by probabilistic matching. Should we be wary of this 5% of matches?

• Disproportionately are in-hospital deaths

Page 24: Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics

Conclusions

De-duplicating patients

The strongest linking combination was patient’s initials + date of birth + sex + ZIP. •Yielded reasonable and apparently robust results.

Given the observed instability of ZIP code in the population of deceased recent patients, we should experiment with substituting ZIP-3.

•This will result in fewer ‘patients’ being identified.

•The trade-off is the creation of more false-positive matches.

Page 25: Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics

ConclusionsLinking patients to mortality recordsThe probabilistic process yields more matched pairs than the deterministic process, but not dramatically so. Overall, the more rigorous probabilistic method validated the results of the deterministic linkage.

Initials, date of birth, and sex • Patient and mortality records generally reliable and consistent.

ZIP • Less reliable - small moves often result in different ZIPs. • Older patients particularly likely to make such moves. • Probabilistic models only used ZIP-3

SSN• Using full SSN limited the success of exact matching. • SSNs were teased out of policy numbers but are often missing or are a spouse’s SSN. • Probabilistic models used only SSN-4.

Page 26: Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics

Conclusions

Both methods created reasonable sets of matched pairs of records

Those sets had a high degree of common pairs.

The deterministic process is probably more accessible and efficient for the general user.

However, the quality is heavily dependent on the completeness and accuracy of the recorded data.

Page 27: Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics

Conclusions

The probabilistic process, particularly as developed in LinkSolv, is more statistically rigorous and will more thoroughly identify matched pairs.

Using multiply-imputed output datasets requires sophisticated statistical treatment by well-trained researchers.

Useful lessons can be learned from the application of both methods to the same datasets. The probabilistic process provides a rigorous evaluation and, perhaps, validation of the results of deterministic exact-matching.

The probabilistic process provides insights into the utility of particular data elements; this may be used to refine and improve a deterministic matching process.

Page 28: Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics

Acknowledgments

We gratefully acknowledge the support of CSTE’s Cardiovascular Disease Surveillance Data Pilot Project

We are indebted to Dr. Michael McGlincy, Strategic Matching Inc., for his thoughtful advice.

Page 29: Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics

Linking Mortality and Inpatient Records: Comparing Deterministic and Probabilistic

Methods

Richard [email protected]

608.267.3858

HerngLeh (Mike) [email protected]

608.267.2487

Wisconsin Division of Public Health