feasibility of deduplication of the de-identified npcr ... · of the de-identified npcr submitted...
TRANSCRIPT
Scott Van Heest Health Scientist, BA, MS, MPH, CPH
Feasibility of Deduplication of the De-identified NPCR
Submitted Data
NAACCR 2018 Annual Conference June 14, 2018, Pittsburgh, PA
RELIABLE TRUSTED SCIENTIFIC DCPC
Brief Background of Duplicate Cases
• Duplicate cases make it difficult to assess the national completeness of case ascertainment
• Centers for Disease Control and Prevention National Program of Cancer Registries (CDC/NPCR) are required submit de-identified cancer incidence data files to CDC/NPCR annually
• Registries are required to remove duplicate records from the data files before submission
• Current processes do not search for duplicate records across CDC/NPCR cancer registry databases
• What if a patient was diagnosed or treated in more than one state?
RELIABLE TRUSTED SCIENTIFIC DCPC
Checking for Duplicates
• Analyze the feasibility of checking all de-identified CDC/
NPCR annual submissions
• De-identified CDC/NPCR cancer files were condensed
using NAACCR variables that could be used to
distinguish unique cases
• These de-identified text files were then used in Link Plus
and Match*Pro to identify potential duplicates
RELIABLE TRUSTED SCIENTIFIC DCPC
Visual Review of Record De-duplication
• De-duplication can be accomplished by visually comparing records and/or sorting cases using variables
• This approach becomes time-consuming, tedious, inefficient, and impractical as the number of records in the file increases
• Feasible to perform computerized record linkage between large files
• Fundamental requirement for accuracy and validity in any disease registry
• NPCR/NAACCR standard is to maintain ≤ 0.1% (≤ 1 per 1,000) duplicates
RELIABLE TRUSTED SCIENTIFIC DCPC
Quick Look at Deterministic Matching
• Computerized comparison where EVERYTHING needs to match EXACTLY:
• Often slight variations exist in the data between the two files for the same variables,
• Or variables are missing from one of the files:
• These variations would prevent a match from being identified Data in examples does not include any NPCR data
RELIABLE TRUSTED SCIENTIFIC DCPC
Probabilistic Matching
• Translates intuition into formal decision rules
• Recommended over traditional deterministic (exact matching) methods when:
• coding errors, reporting variations, missing data or duplicate records, abbreviations, nicknames (Elizabeth - Betsey – Beth…)
• My favorite: Van Heest = Vanheest or Van or Heest
• Estimate probability/likelihood that two records are for the same person versus different people
Data in example does not include any NPCR data
RELIABLE TRUSTED SCIENTIFIC DCPC
Probabilistic Matching
• Finds the records that seem to indicate duplicate records
• Calculates a score that indicates, for any pair of records, how likely it is that they both refer to the same case
• Sorts the likely and possible matched pairs in order of their scores
• Uses a defined threshold (cutoff value) for accepting or rejecting a potential link
• Discards unlikely matched pairs (scores below cutoff value) • Gray area: range of scores above the cutoff values -- considered
uncertain matches
• Can set second cutoff value for certain matches, but not used due to multiple primaries
• Manually review the uncertain (gray) matches
RELIABLE TRUSTED SCIENTIFIC DCPC
Probabilistic Matching … a Bit More Detailed
• The total score for a linkage between any two records is the sum of the scores generated from matching individual fields
• The score assigned to a matching of individual fields is:
• The probability that a comparison pair is a match - M Probability - similar to “sensitivity” // is the ability of a test to
correctly identify those with the disease (true positive rate)
• Reduced by the probability that a comparison pair is not a match - U Probability - similar to “specificity” // is the ability of the test to
correctly identify those without the disease (true negative rate)
• Agreement argues for a match
• Disagreement argues against a match
RELIABLE TRUSTED SCIENTIFIC DCPC
Probabilistic Matching ….
• Uncommon value is weighted more for linkage than a common value
• Van Heest versus Smith
• A more specific variable weighted more strongly for linkage than agreement on a less specific one
• SSN versus Sex; Last name versus Middle Initial
• A weight is calculated for each field comparison
• A total weight (or “score”) is derived by summing the separate field comparisons across all fields being compared
• Probabilistic weights are:
• Field-specific – Birth date versus Sex
• Value-specific - “Jane” versus “Jennifer”
RELIABLE TRUSTED SCIENTIFIC DCPC
CDC/NPCR - Link Plus Software
• Has been publically available for over 15 years
• Stand-alone probabilistic record linkage program
• Combines ease of use and statistical sophistication
• Detects duplicates within a single database, or links 2 database files
• Supports fixed width files as well as delimited files
• Link Plus is free!
RELIABLE TRUSTED SCIENTIFIC DCPC
Link Plus Is Easy to Use
• Designed especially for cancer registry work
• HOWEVER, can be used with virtually any data
• Mathematics largely hidden from user
• Default values supplied for many tasks
• Familiar Windows interface
• Includes Help and examples
RELIABLE TRUSTED SCIENTIFIC DCPC
Match*Pro
• A new linkage software developed by IMS for NCI/SEER built upon the Link Plus application framework with improved functionality and usability
• Match*Pro was developed to automate the linkage requirements for the Virtual Pooled Registry Cancer Linkage System (VPR-CLS)
• The application will be distributed and supported independently on both NCI/SEER and CDC/NPCR websites
RELIABLE TRUSTED SCIENTIFIC DCPC
Blocking and Matching Variables Used
• Blocking • Date of Birth
• Date of Diagnosis
• Histology Type
• Matching
• Date of Birth
• Date of Diagnosis
• Histology Type
• State of Diagnosis
• Sex
• Race 1
• Cutoff score of 10
• Overall sensitivity of 3
RELIABLE TRUSTED SCIENTIFIC DCPC
NPCR National Case Counts
2015 reporting represents 12-month rather than 24-month data
RELIABLE TRUSTED SCIENTIFIC DCPC
NPCR Potential Pairs Identified
Potential pairs have multiple primaries included
NPCR Potential Pairs Identified
24000
26750
29500
32250
35000
20
00
20
01
20
02
20
03
20
04
20
05
20
06
20
07
20
08
20
09
20
10
20
11
20
12
20
13
20
14
20
15
RELIABLE TRUSTED SCIENTIFIC DCPC
What Do the Results Indicate?
• That there are around 1.9 percent potential duplicates
• Ranges from around 1.8% to slightly over 2%Biggest difference across annual submissions is about 0.19% This is what we hoped to see, consistency over time
• That the total count for potential duplicates is rising at about the same rate nationally as the NPCR case counts increase
• There were no spikes and little if any trends for the percentage of potential duplicates as a percentage of case counts
RELIABLE TRUSTED SCIENTIFIC DCPC
What the Results Don’t Show
• These are potential duplicates, not confirmed duplicates
• Why?
• Multiple primaries are likely the primary reason for potential duplicates
• The multiple primary rules guide and standardize the process of determining the number of primaries
• https://seer.cancer.gov/tools/mphrules/
• Other potential duplicates of cases were identified usingde-identified data
• Further investigation may determine the variables used that matched were for two different individuals
• Other explanations…..this is preliminary research!
RELIABLE TRUSTED SCIENTIFIC DCPC
Duplicate Assessment on the National Data
• Pros:
• Convenient
• Rather minimal amount of skill to use
• Confirms that national data that are reported are consistent
• Does not require more than the de-identified submitted data
• Cons:
• Not very informative if there are actually more duplicates
• Confirming additional duplicates would be labor-intensive
• Would require coordination with potentially many other registries
• Would require tools to identify multiple primaries
RELIABLE TRUSTED SCIENTIFIC DCPC
Conclusion
• May be something to pursue in the future
• Support for the development of additional tools
RELIABLE TRUSTED SCIENTIFIC DCPC
Scott M. Van Heest, MPH, MS, CPH
[email protected] 770-488-4863
For more information, please contact Centers for Disease Control and Prevention
1600 Clifton Road NE, Atlanta, GA 30333
Telephone: 1-800-CDC-INFO (232-4636)/TTY: 1-888-232-6348
Visit: www.cdc.gov | Contact CDC at: 1-800-CDC-INFO or www.cdc.gov/info
The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.
The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.
Go to the official federal source of cancer prevention information: www.cdc.gov/cancer
@CDC_Cancer
Follow DCPC Online!