feasibility of deduplication of the de-identified npcr ... · of the de-identified npcr submitted...

23
Scott Van Heest Health Scientist, BA, MS, MPH, CPH Feasibility of Deduplication of the De-identified NPCR Submitted Data NAACCR 2018 Annual Conference June 14, 2018, Pittsburgh, PA

Upload: others

Post on 27-Mar-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Scott Van Heest Health Scientist, BA, MS, MPH, CPH

Feasibility of Deduplication of the De-identified NPCR

Submitted Data

NAACCR 2018 Annual Conference June 14, 2018, Pittsburgh, PA

RELIABLE TRUSTED SCIENTIFIC DCPC

Brief Background of Duplicate Cases

• Duplicate cases make it difficult to assess the national completeness of case ascertainment

• Centers for Disease Control and Prevention National Program of Cancer Registries (CDC/NPCR) are required submit de-identified cancer incidence data files to CDC/NPCR annually

• Registries are required to remove duplicate records from the data files before submission

• Current processes do not search for duplicate records across CDC/NPCR cancer registry databases

• What if a patient was diagnosed or treated in more than one state?

RELIABLE TRUSTED SCIENTIFIC DCPC

Checking for Duplicates

• Analyze the feasibility of checking all de-identified CDC/

NPCR annual submissions

• De-identified CDC/NPCR cancer files were condensed

using NAACCR variables that could be used to

distinguish unique cases

• These de-identified text files were then used in Link Plus

and Match*Pro to identify potential duplicates

RELIABLE TRUSTED SCIENTIFIC DCPC

Visual Review of Record De-duplication

• De-duplication can be accomplished by visually comparing records and/or sorting cases using variables

• This approach becomes time-consuming, tedious, inefficient, and impractical as the number of records in the file increases

• Feasible to perform computerized record linkage between large files

• Fundamental requirement for accuracy and validity in any disease registry

• NPCR/NAACCR standard is to maintain ≤ 0.1% (≤ 1 per 1,000) duplicates

RELIABLE TRUSTED SCIENTIFIC DCPC

Quick Look at Deterministic Matching

• Computerized comparison where EVERYTHING needs to match EXACTLY:

• Often slight variations exist in the data between the two files for the same variables,

• Or variables are missing from one of the files:

• These variations would prevent a match from being identified Data in examples does not include any NPCR data

RELIABLE TRUSTED SCIENTIFIC DCPC

Probabilistic Matching

• Translates intuition into formal decision rules

• Recommended over traditional deterministic (exact matching) methods when:

• coding errors, reporting variations, missing data or duplicate records, abbreviations, nicknames (Elizabeth - Betsey – Beth…)

• My favorite: Van Heest = Vanheest or Van or Heest

• Estimate probability/likelihood that two records are for the same person versus different people

Data in example does not include any NPCR data

RELIABLE TRUSTED SCIENTIFIC DCPC

Probabilistic Matching

• Finds the records that seem to indicate duplicate records

• Calculates a score that indicates, for any pair of records, how likely it is that they both refer to the same case

• Sorts the likely and possible matched pairs in order of their scores

• Uses a defined threshold (cutoff value) for accepting or rejecting a potential link

• Discards unlikely matched pairs (scores below cutoff value) • Gray area: range of scores above the cutoff values -- considered

uncertain matches

• Can set second cutoff value for certain matches, but not used due to multiple primaries

• Manually review the uncertain (gray) matches

RELIABLE TRUSTED SCIENTIFIC DCPC

Probabilistic Matching … a Bit More Detailed

• The total score for a linkage between any two records is the sum of the scores generated from matching individual fields

• The score assigned to a matching of individual fields is:

• The probability that a comparison pair is a match - M Probability - similar to “sensitivity” // is the ability of a test to

correctly identify those with the disease (true positive rate)

• Reduced by the probability that a comparison pair is not a match - U Probability - similar to “specificity” // is the ability of the test to

correctly identify those without the disease (true negative rate)

• Agreement argues for a match

• Disagreement argues against a match

RELIABLE TRUSTED SCIENTIFIC DCPC

Probabilistic Matching ….

• Uncommon value is weighted more for linkage than a common value

• Van Heest versus Smith

• A more specific variable weighted more strongly for linkage than agreement on a less specific one

• SSN versus Sex; Last name versus Middle Initial

• A weight is calculated for each field comparison

• A total weight (or “score”) is derived by summing the separate field comparisons across all fields being compared

• Probabilistic weights are:

• Field-specific – Birth date versus Sex

• Value-specific - “Jane” versus “Jennifer”

RELIABLE TRUSTED SCIENTIFIC DCPC

CDC/NPCR - Link Plus Software

• Has been publically available for over 15 years

• Stand-alone probabilistic record linkage program

• Combines ease of use and statistical sophistication

• Detects duplicates within a single database, or links 2 database files

• Supports fixed width files as well as delimited files

• Link Plus is free!

RELIABLE TRUSTED SCIENTIFIC DCPC

Link Plus Is Easy to Use

• Designed especially for cancer registry work

• HOWEVER, can be used with virtually any data

• Mathematics largely hidden from user

• Default values supplied for many tasks

• Familiar Windows interface

• Includes Help and examples

RELIABLE TRUSTED SCIENTIFIC DCPC

Match*Pro

• A new linkage software developed by IMS for NCI/SEER built upon the Link Plus application framework with improved functionality and usability

• Match*Pro was developed to automate the linkage requirements for the Virtual Pooled Registry Cancer Linkage System (VPR-CLS)

• The application will be distributed and supported independently on both NCI/SEER and CDC/NPCR websites

RELIABLE TRUSTED SCIENTIFIC DCPC

Blocking and Matching Variables Used

• Blocking • Date of Birth

• Date of Diagnosis

• Histology Type

• Matching

• Date of Birth

• Date of Diagnosis

• Histology Type

• State of Diagnosis

• Sex

• Race 1

• Cutoff score of 10

• Overall sensitivity of 3

RELIABLE TRUSTED SCIENTIFIC DCPC

NPCR National Case Counts

2015 reporting represents 12-month rather than 24-month data

RELIABLE TRUSTED SCIENTIFIC DCPC

NPCR Potential Pairs Identified

Potential pairs have multiple primaries included

NPCR Potential Pairs Identified

24000

26750

29500

32250

35000

20

00

20

01

20

02

20

03

20

04

20

05

20

06

20

07

20

08

20

09

20

10

20

11

20

12

20

13

20

14

20

15

RELIABLE TRUSTED SCIENTIFIC DCPC

Potential Duplicates as a Percentage of NPCR Case Counts

RELIABLE TRUSTED SCIENTIFIC DCPC

What Do the Results Indicate?

• That there are around 1.9 percent potential duplicates

• Ranges from around 1.8% to slightly over 2%Biggest difference across annual submissions is about 0.19% This is what we hoped to see, consistency over time

• That the total count for potential duplicates is rising at about the same rate nationally as the NPCR case counts increase

• There were no spikes and little if any trends for the percentage of potential duplicates as a percentage of case counts

RELIABLE TRUSTED SCIENTIFIC DCPC

What the Results Don’t Show

• These are potential duplicates, not confirmed duplicates

• Why?

• Multiple primaries are likely the primary reason for potential duplicates

• The multiple primary rules guide and standardize the process of determining the number of primaries

• https://seer.cancer.gov/tools/mphrules/

• Other potential duplicates of cases were identified usingde-identified data

• Further investigation may determine the variables used that matched were for two different individuals

• Other explanations…..this is preliminary research!

RELIABLE TRUSTED SCIENTIFIC DCPC

Duplicate Assessment on the National Data

• Pros:

• Convenient

• Rather minimal amount of skill to use

• Confirms that national data that are reported are consistent

• Does not require more than the de-identified submitted data

• Cons:

• Not very informative if there are actually more duplicates

• Confirming additional duplicates would be labor-intensive

• Would require coordination with potentially many other registries

• Would require tools to identify multiple primaries

RELIABLE TRUSTED SCIENTIFIC DCPC

Conclusion

• May be something to pursue in the future

• Support for the development of additional tools

RELIABLE TRUSTED SCIENTIFIC DCPC

www.cdc.gov/uscs

RELIABLE TRUSTED SCIENTIFIC DCPC

Scott M. Van Heest, MPH, MS, CPH

[email protected] 770-488-4863

For more information, please contact Centers for Disease Control and Prevention

1600 Clifton Road NE, Atlanta, GA 30333

Telephone: 1-800-CDC-INFO (232-4636)/TTY: 1-888-232-6348

Visit: www.cdc.gov | Contact CDC at: 1-800-CDC-INFO or www.cdc.gov/info

The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.

The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.

Go to the official federal source of cancer prevention information: www.cdc.gov/cancer

@CDC_Cancer

Follow DCPC Online!