day1d dealingwithduplicates - amchp · 27/11/2011 · sas chooses the first one it finds based on...

11/27/2011

1

Derek Chapman, PhDDecember 2011

Data Linkage Techniques:Tricks of the Trade

General data cleaning issue

Linkage can create more duplicates◦ Easier to deal with before linkage

Accurate counts are important to evaluate linkage success

Records defined by 1 or more “key” variables Can be existing variable (e.g., SSN) or

generated (e.g., sequential number) Unique key should not contain missing or

NULL values Unique key should uniquely identify all

possible rows in a table, not just the current data

11/27/2011

2

Birth Data (Infants)◦ Certificate number AND year

Birth Data (Maternal Birth History)◦ Mom SSN AND Date of Delivery AND Birth Order

Hospital Discharge Data◦ SSN AND Data of Discharge

Birth Defects Registry◦ Child Name AND Child DOB AND ICD-9CM Code

Newborn Hearing Screening◦ Child Name AND Child DOB AND Date of Screening

Prior to linkage, you will need to identify what types of data are available for matching between the datasets (e.g., mom, dad, infant)◦ This can be (and usually is) different from the

variable(s) that define the unique record in the table

After performing data cleaning/ standardization, create a de-duplicated linking table containing only the variables needed for matching

Infant Linkage Maternal Linkage

Births Birth DefectsChild first name Child first nameChild Surname Child SurnameChild DOB Child DOBMom first name Discharge DateMom maiden ICD-9-CM CodeMom last name Mom DOBMom SSNDad 1st nameDad SurnameDad DOBDad SSN

Births Family PlanningChild first nameChild SurnameChild DOB Date EnrolledMom first name Mom first nameMom MaidenMom last name Mom SurnameMom DOB Mom DOBMom SSNDad 1st nameDad SurnameDad DOBDad SSN

11/27/2011

3

Exact duplicates Variation in key fields (same person)◦ Data entry errors◦ Value changes

Multiple births (same data except first/middle name)

Longitudinal data

Cert Child Child Child Mom Mom Mom Dad Dad DadNum Fname Lname DOB Fname Maiden DOB Fname Lname DOB

00001 Mohamed Kruse 4/12/2008 Kayla White 7/4/1972 Rex Kruse 12/15/196800002 Alexis Lockett 10/31/2001 Mary Lockett 9/20/1973 John Lockett 12/23/197800003 Luke Skywalker 7/4/2005 Padmé Amidala 10/31/1975 Anakin Skywalker 2/14/198000004 Leia Skywalker 7/4/2005 Padmé Amidala 10/31/1975 Anakin Skywalker 2/14/198000005 Johanna Trotter 2/14/2007 Susan Jackson 8/18/1979 Zach Trotter 11/17/196500006 Keith Ward 11/23/2006 Teri Sperry 3/14/1985 Kobe Ward 3/15/197200007 Keith Ward 11/23/2006 Teri Sperry 3/14/1985 Kobe Ward 3/15/197200008 Joan Jett 8/14/2004 Caroline Jett 1/25/1982 LeBron Jett 3/2/197300008 Joan Jett 8/14/2004 Caroline Jett 1/25/1982 LeBron Jett 3/2/1973

first_name last_name DOB sex race weight

Birth DefectCode

Alexis ROMANO 12/12/2002 0 3 2433 223Andrea HOLLIS 10/3/1999 0 1 2800 270Andrea HOLLIS 10/3/1999 0 1 1544 380Kayla CHRISTENSON 0 2 1754 193Jordan SOUZA 4/27/1996 0 4 2200 333Dean PADILLA 9/2/1999 1 3 2675 323Johnathan BLANCO 3/3/2011 1 3 1243 484Kyla CALDERON 8/29/2001 1 2 2589 303Jaiden FORTUNE 10/28/1993 0 3 1392 407

11/27/2011

4

Shortest method in terms of lines of codePROC SORT data = library.datasetOUT = library.tablenameSAS options [nodup|nodupkey];BY variable(s);run;

OUT = creates an output dataset nodup removes exact duplicates (all vars same) nodupkey removes duplicates based only on key

variables (in the BY statement)

An example:libname MCH 'c:\AMCHP\TrainingData';

PROC SORT data = MCH.EnrollmentOUT = MCH.Enroll_NoDupsnodupkey;BY Fname, Lname, DOB;run;

11/27/2011

5

Fname Lname DOB RaceCode nodup nodupkeyPenelope MAYO 2/23/2011 1 Keep KeepPenelope MAYO 2/23/2011 Keep DropChristian MEYER 1/13/2010 2 Keep KeepChristian MEYER 1/13/2010 1 Keep DropBrandon LOUIS 2/14/2011 3 Keep KeepHazel CARNEY 7/29/2007 Keep KeepAva ZIMMER 7/22/2007 2 Keep KeepAva ZIMMER 7/22/2007 2 Drop Drop

SAS chooses the first one it finds ◦ Based on the previous sort order◦ Not usually an issue with preparation of linking

files, since you are only using key fields

Sort data (regular proc sort with no options) BEFORE you de-duplicate to ensure that the correct row of data is selected◦ e.g. “sort BY descending screen_date” so the most

recent hearing loss status will be selected for each child

1. Run a PROC SORT without the NODUPKEY option◦ The BY statement should have the key variables that

uniquely identify a row (and an optional variable to tell it which record to choose)

2. Create a single KeyID if you have multiple key variables (use concatenation functions)

3. Run a Data Step that selects the first or last sorted KeyID in the group

11/27/2011

6

var1 || var2 || …varx CAT (var1, var2…varx)◦ Concatenates as if the || was used

CATT (var1, var2…varx)◦ Removes trailing blanks from each var

CATS (var1, var2…varx)◦ Removes leading and trailing blanks from each var

CATX (‘delimiter’,var1, var2…varx)◦ Same as CATX but adds a delimiter

Create new concatenated key variable if neededdata new; set old;newkeyID = CATS (lname, fname, dob);run; Sort the data by the key variable and an optional

sorting variable (e.g., encounter_date ) ◦ Use descending before vars in BY statement to ensure

nonmissing data or most recent dates will be selected as the “first one”

proc sort data = library.dataset; BY newKeyID optionalvar;run;

Use the Data Step to find duplicates (or unique records, or both)

data libary.datauniquelibrary.duplicates;set library.SourceDataset;BY newKeyID;if (first.newKeyID|last.newKeyID = 1) then output libary.dataunique;else output library.duplicates;run;

11/27/2011

7

libname MCH 'c:\AMCHP\TrainingData';

data MCH.Enrollment2; set MCH.enrollment;keyID = CATS (lname, fname, dob);run;

proc sort data = MCH.enrollment2; by KeyID descending EnrollDate;run;

data MCH.keep MCH.dups;set MCH.enrollment2;BY KeyID;if first.KeyID=1 then output MCH.keep;else output MCH.dups;run;

SQL = Structured Query Language PROC SQL turns on SQL and it remains on

until a QUIT statement is issued◦ You do not need to submit a run; statement

There is no need to sort prior to merges Do NOT need common variable names for key

variables when merging Semicolons are only needed at the end of the

statement, not after each “clause”

CREATE TABLE library.newtable name AS◦ Sends output to a new table (like data newtable;)

SELECT keyvar1, keyvar2, …keyvarX FROM library.oldtablename◦ Identifies source table (like set oldtable;) and

variables to be output (like keep var1 var2;)◦ For non-key variables, you must select the first or

last record using MIN or MAX keyword and rename the variable with AS SELECT fname, lname, dob, MAX (screen_date) AS

RecentScreen GROUP BY keyvar1, keyvar2, …keyvarX◦ Sorts and groups by keyvariables

11/27/2011

8

proc sql;create table MCH.sqlunique asselect fname, lname, dob, max(racecode) as race, max(screen_date) as RecentScreenfrom MCH.enrollmentgroup by lname, fname, dob;quit;

Deterministic linkages ◦ Linkage steps◦ Creating subsets of data (e.g., identifying remaining

unlinked data after a linkage pass has taken place) Adding in analysis variables after linkage is

complete

A BA+B

11/27/2011

9

Goal of merging is to make dataset WIDER Identify which variable(s) the datasets have in

common PROC SORT each dataset by common

variable(s) If the merging variables do not have the same

name, you must rename them in the data step

Most often you want to tag the dataset from which we want to keep all records

Think of this as the “master” dataset Do this by using in=<tag>◦ Assigns a tag, such as a letter or word, to records in

the dataset you want as your “master” Use the condition if <tag> to keep all records

marked with the tag

Dataset #1 has the following records

ID# Name CodeA Mary 1B Sue 2C Pam 3D Ann 5

Dataset #2 has definitions for the codes

Code Definition1 Medicaid2 CHIP3 Private Insurance4 Tricare5 No Insurance

data together; merge dataset1 dataset2; by code; run;

11/27/2011

10

ID Name Code DefinitionA Mary 1 MedicaidB Sue 2 CHIPC Pam 3 Private Insurance

4 TricareD Ann 5 No Insurance

Result: a dataset with a blank record from dataset 1, which had no corresponding record for code 4

data together; merge dataset1 (in=a) dataset2; if a; BY code; run;

The in=a; tags each record in dataset1 The if a; will keep all records from

dataset 1 and only those that match from dataset2

ID# Name Code <TAG>A Mary 1 aB Sue 2 aC Pam 3 aD Ann 5 a

ID# Name Code Definition <TAG>A Mary 1 Medicaid aB Sue 2 CHIP aC Pam 3 Private Insurance a

4 TricareD Ann 5 No Insurance a

SAS will exclude any record without an “a”

ID# Name Code DefinitionA Mary 1 MedicaidB Sue 2 CHIPC Pam 3 Private InsuranceD Ann 5 No Insurance

NOTE: the in= tag can be any character or word you want.

Your final “together” dataset will look like this”

11/27/2011

11

proc sort data = dataset1;by code; run;proc sort data = dataset2;by code; run;data together; merge dataset1 (in=a) dataset2 (in=b);by code; if b; run;

Proc sort data = dataset1;by BirthCert; run;Proc sort data = dataset2;by CertNum;run;data together; merge dataset1 (in=g rename=(BirthCert=CertNum)) dataset2;by code; if g; run;

11/27/2011

12

A+B

data linked;mergemoms(in=A) births(in=B);by momID;if A and B;run;

Only keeps records where key fields match in both tables

data linked;mergemoms(in=A) births(in=B);by momID;if A;run;

Keeps all records in dataset A and any matching data in dataset B

data linked;mergemoms(in=A) births(in=B);by momID;if B;run;

Keeps all records in dataset B and any matching data in dataset B

11/27/2011

13

data linked;mergemoms(in=A) births(in=B);by momID;run;

Keeps all records in both datasets

data linked;mergemoms(in=A) births(in=B);by momID;if A and not B;run;

Keeps all records in A that did NOT match with B

day1d dealingwithduplicates - amchp · 27/11/2011 · sas chooses the first one it finds based on...

Documents