day1d dealingwithduplicates - amchp · 27/11/2011 · sas chooses the first one it finds based on...
TRANSCRIPT
11/27/2011
1
Derek Chapman, PhDDecember 2011
Data Linkage Techniques:Tricks of the Trade
General data cleaning issue
Linkage can create more duplicates◦ Easier to deal with before linkage
Accurate counts are important to evaluate linkage success
Records defined by 1 or more “key” variables Can be existing variable (e.g., SSN) or
generated (e.g., sequential number) Unique key should not contain missing or
NULL values Unique key should uniquely identify all
possible rows in a table, not just the current data
11/27/2011
2
Birth Data (Infants)◦ Certificate number AND year
Birth Data (Maternal Birth History)◦ Mom SSN AND Date of Delivery AND Birth Order
Hospital Discharge Data◦ SSN AND Data of Discharge
Birth Defects Registry◦ Child Name AND Child DOB AND ICD-9CM Code
Newborn Hearing Screening◦ Child Name AND Child DOB AND Date of Screening
Prior to linkage, you will need to identify what types of data are available for matching between the datasets (e.g., mom, dad, infant)◦ This can be (and usually is) different from the
variable(s) that define the unique record in the table
After performing data cleaning/ standardization, create a de-duplicated linking table containing only the variables needed for matching
Infant Linkage Maternal Linkage
Births Birth DefectsChild first name Child first nameChild Surname Child SurnameChild DOB Child DOBMom first name Discharge DateMom maiden ICD-9-CM CodeMom last name Mom DOBMom SSNDad 1st nameDad SurnameDad DOBDad SSN
Births Family PlanningChild first nameChild SurnameChild DOB Date EnrolledMom first name Mom first nameMom MaidenMom last name Mom SurnameMom DOB Mom DOBMom SSNDad 1st nameDad SurnameDad DOBDad SSN
11/27/2011
3
Exact duplicates Variation in key fields (same person)◦ Data entry errors◦ Value changes
Multiple births (same data except first/middle name)
Longitudinal data
Cert Child Child Child Mom Mom Mom Dad Dad DadNum Fname Lname DOB Fname Maiden DOB Fname Lname DOB
00001 Mohamed Kruse 4/12/2008 Kayla White 7/4/1972 Rex Kruse 12/15/196800002 Alexis Lockett 10/31/2001 Mary Lockett 9/20/1973 John Lockett 12/23/197800003 Luke Skywalker 7/4/2005 Padmé Amidala 10/31/1975 Anakin Skywalker 2/14/198000004 Leia Skywalker 7/4/2005 Padmé Amidala 10/31/1975 Anakin Skywalker 2/14/198000005 Johanna Trotter 2/14/2007 Susan Jackson 8/18/1979 Zach Trotter 11/17/196500006 Keith Ward 11/23/2006 Teri Sperry 3/14/1985 Kobe Ward 3/15/197200007 Keith Ward 11/23/2006 Teri Sperry 3/14/1985 Kobe Ward 3/15/197200008 Joan Jett 8/14/2004 Caroline Jett 1/25/1982 LeBron Jett 3/2/197300008 Joan Jett 8/14/2004 Caroline Jett 1/25/1982 LeBron Jett 3/2/1973
first_name last_name DOB sex race weight
Birth DefectCode
Alexis ROMANO 12/12/2002 0 3 2433 223Andrea HOLLIS 10/3/1999 0 1 2800 270Andrea HOLLIS 10/3/1999 0 1 1544 380Kayla CHRISTENSON 0 2 1754 193Jordan SOUZA 4/27/1996 0 4 2200 333Dean PADILLA 9/2/1999 1 3 2675 323Johnathan BLANCO 3/3/2011 1 3 1243 484Kyla CALDERON 8/29/2001 1 2 2589 303Jaiden FORTUNE 10/28/1993 0 3 1392 407
11/27/2011
4
Shortest method in terms of lines of codePROC SORT data = library.datasetOUT = library.tablenameSAS options [nodup|nodupkey];BY variable(s);run;
OUT = creates an output dataset nodup removes exact duplicates (all vars same) nodupkey removes duplicates based only on key
variables (in the BY statement)
An example:libname MCH 'c:\AMCHP\TrainingData';
PROC SORT data = MCH.EnrollmentOUT = MCH.Enroll_NoDupsnodupkey;BY Fname, Lname, DOB;run;
11/27/2011
5
Fname Lname DOB RaceCode nodup nodupkeyPenelope MAYO 2/23/2011 1 Keep KeepPenelope MAYO 2/23/2011 Keep DropChristian MEYER 1/13/2010 2 Keep KeepChristian MEYER 1/13/2010 1 Keep DropBrandon LOUIS 2/14/2011 3 Keep KeepHazel CARNEY 7/29/2007 Keep KeepAva ZIMMER 7/22/2007 2 Keep KeepAva ZIMMER 7/22/2007 2 Drop Drop
SAS chooses the first one it finds ◦ Based on the previous sort order◦ Not usually an issue with preparation of linking
files, since you are only using key fields
Sort data (regular proc sort with no options) BEFORE you de-duplicate to ensure that the correct row of data is selected◦ e.g. “sort BY descending screen_date” so the most
recent hearing loss status will be selected for each child
1. Run a PROC SORT without the NODUPKEY option◦ The BY statement should have the key variables that
uniquely identify a row (and an optional variable to tell it which record to choose)
2. Create a single KeyID if you have multiple key variables (use concatenation functions)
3. Run a Data Step that selects the first or last sorted KeyID in the group
11/27/2011
6
var1 || var2 || …varx CAT (var1, var2…varx)◦ Concatenates as if the || was used
CATT (var1, var2…varx)◦ Removes trailing blanks from each var
CATS (var1, var2…varx)◦ Removes leading and trailing blanks from each var
CATX (‘delimiter’,var1, var2…varx)◦ Same as CATX but adds a delimiter
Create new concatenated key variable if neededdata new; set old;newkeyID = CATS (lname, fname, dob);run; Sort the data by the key variable and an optional
sorting variable (e.g., encounter_date ) ◦ Use descending before vars in BY statement to ensure
nonmissing data or most recent dates will be selected as the “first one”
proc sort data = library.dataset; BY newKeyID optionalvar;run;
Use the Data Step to find duplicates (or unique records, or both)
data libary.datauniquelibrary.duplicates;set library.SourceDataset;BY newKeyID;if (first.newKeyID|last.newKeyID = 1) then output libary.dataunique;else output library.duplicates;run;
11/27/2011
7
libname MCH 'c:\AMCHP\TrainingData';
data MCH.Enrollment2; set MCH.enrollment;keyID = CATS (lname, fname, dob);run;
proc sort data = MCH.enrollment2; by KeyID descending EnrollDate;run;
data MCH.keep MCH.dups;set MCH.enrollment2;BY KeyID;if first.KeyID=1 then output MCH.keep;else output MCH.dups;run;
SQL = Structured Query Language PROC SQL turns on SQL and it remains on
until a QUIT statement is issued◦ You do not need to submit a run; statement
There is no need to sort prior to merges Do NOT need common variable names for key
variables when merging Semicolons are only needed at the end of the
statement, not after each “clause”
CREATE TABLE library.newtable name AS◦ Sends output to a new table (like data newtable;)
SELECT keyvar1, keyvar2, …keyvarX FROM library.oldtablename◦ Identifies source table (like set oldtable;) and
variables to be output (like keep var1 var2;)◦ For non-key variables, you must select the first or
last record using MIN or MAX keyword and rename the variable with AS SELECT fname, lname, dob, MAX (screen_date) AS
RecentScreen GROUP BY keyvar1, keyvar2, …keyvarX◦ Sorts and groups by keyvariables
11/27/2011
8
proc sql;create table MCH.sqlunique asselect fname, lname, dob, max(racecode) as race, max(screen_date) as RecentScreenfrom MCH.enrollmentgroup by lname, fname, dob;quit;
Deterministic linkages ◦ Linkage steps◦ Creating subsets of data (e.g., identifying remaining
unlinked data after a linkage pass has taken place) Adding in analysis variables after linkage is
complete
A BA+B
11/27/2011
9
Goal of merging is to make dataset WIDER Identify which variable(s) the datasets have in
common PROC SORT each dataset by common
variable(s) If the merging variables do not have the same
name, you must rename them in the data step
Most often you want to tag the dataset from which we want to keep all records
Think of this as the “master” dataset Do this by using in=<tag>◦ Assigns a tag, such as a letter or word, to records in
the dataset you want as your “master” Use the condition if <tag> to keep all records
marked with the tag
Dataset #1 has the following records
ID# Name CodeA Mary 1B Sue 2C Pam 3D Ann 5
Dataset #2 has definitions for the codes
Code Definition1 Medicaid2 CHIP3 Private Insurance4 Tricare5 No Insurance
data together; merge dataset1 dataset2; by code; run;
11/27/2011
10
ID Name Code DefinitionA Mary 1 MedicaidB Sue 2 CHIPC Pam 3 Private Insurance
4 TricareD Ann 5 No Insurance
Result: a dataset with a blank record from dataset 1, which had no corresponding record for code 4
data together; merge dataset1 (in=a) dataset2; if a; BY code; run;
The in=a; tags each record in dataset1 The if a; will keep all records from
dataset 1 and only those that match from dataset2
ID# Name Code <TAG>A Mary 1 aB Sue 2 aC Pam 3 aD Ann 5 a
ID# Name Code Definition <TAG>A Mary 1 Medicaid aB Sue 2 CHIP aC Pam 3 Private Insurance a
4 TricareD Ann 5 No Insurance a
SAS will exclude any record without an “a”
ID# Name Code DefinitionA Mary 1 MedicaidB Sue 2 CHIPC Pam 3 Private InsuranceD Ann 5 No Insurance
NOTE: the in= tag can be any character or word you want.
Your final “together” dataset will look like this”
11/27/2011
11
proc sort data = dataset1;by code; run;proc sort data = dataset2;by code; run;data together; merge dataset1 (in=a) dataset2 (in=b);by code; if b; run;
Proc sort data = dataset1;by BirthCert; run;Proc sort data = dataset2;by CertNum;run;data together; merge dataset1 (in=g rename=(BirthCert=CertNum)) dataset2;by code; if g; run;
11/27/2011
12
A+B
data linked;mergemoms(in=A) births(in=B);by momID;if A and B;run;
Only keeps records where key fields match in both tables
data linked;mergemoms(in=A) births(in=B);by momID;if A;run;
Keeps all records in dataset A and any matching data in dataset B
data linked;mergemoms(in=A) births(in=B);by momID;if B;run;
Keeps all records in dataset B and any matching data in dataset B