data cleaning public
TRANSCRIPT
-
8/6/2019 Data Cleaning Public
1/11
Felicity Clemens 18 May 2005Felicity Clemens 18 May 2005
Data cleaning: hintsData cleaning: hints
and tipsand tipsFelicity ClemensFelicity Clemens
Stata Users Group meetingStata Users Group meetingLondon, 17 & 18London, 17 & 18thth May 2005May 2005
-
8/6/2019 Data Cleaning Public
2/11
Felicity Clemens 18 May 2005Felicity Clemens 18 May 2005
IntroductionIntroduction
Data cleaningData cleaning one of the most timeone of the most time
consuming jobs of all!consuming jobs of all!
Many ways of attacking the sameMany ways of attacking the same
problem when using Stataproblem when using Stata
The talk will describe some commonThe talk will describe some common
problems and propose possible solutionsproblems and propose possible solutions
These are mostly reminders!These are mostly reminders!
-
8/6/2019 Data Cleaning Public
3/11
Felicity Clemens 18 May 2005Felicity Clemens 18 May 2005
ContentsContents
1)1) Introduction to the first datasetsIntroduction to the first datasets
2)2) Identifying and removing duplicatesIdentifying and removing duplicates by handby hand
3)3) Merging data and uses of theMerging data and uses of themerge commandmerge command
4)4) Generating a moving targetGenerating a moving targetvariablevariable
-
8/6/2019 Data Cleaning Public
4/11
Felicity Clemens 18 May 2005Felicity Clemens 18 May 2005
The studyThe study
A caseA case--control study carried across 3control study carried across 3
central European countriescentral European countries
Exposure of interest: exposure toExposure of interest: exposure to
chemicals in the environmentchemicals in the environment
Outcome of interest: cancerOutcome of interest: cancer
-
8/6/2019 Data Cleaning Public
5/11
Felicity Clemens 18 May 2005Felicity Clemens 18 May 2005
Identifying duplicates in aIdentifying duplicates in a
datasetdataset This can be done automatically (usingThis can be done automatically (using
the duplicates set of commands)the duplicates set of commands)
We will demonstrate a manual method ofWe will demonstrate a manual method ofidentifying duplicatesidentifying duplicates
Two different possibilities:Two different possibilities:
The same data have been entered on moreThe same data have been entered on morethan one occasion;than one occasion;
-
8/6/2019 Data Cleaning Public
6/11
Felicity Clemens 18 May 2005Felicity Clemens 18 May 2005
Identifying duplicates in aIdentifying duplicates in a
datasetdataset This can be done automatically (usingThis can be done automatically (using
the duplicates set of commands)the duplicates set of commands)
We will demonstrate a manual method ofWe will demonstrate a manual method ofidentifying duplicatesidentifying duplicates
Two different possibilities:Two different possibilities:
The same data have been entered on moreThe same data have been entered on morethan one occasion;than one occasion;
Different data have been entered using theDifferent data have been entered using thesame identifier (id numbers)same identifier (id numbers)
-
8/6/2019 Data Cleaning Public
7/11
Felicity Clemens 18 May 2005Felicity Clemens 18 May 2005
The merge commandThe merge command
A necessary command in dataA necessary command in data
management of most big studiesmanagement of most big studies
There are many different uses of the mergeThere are many different uses of the merge
command. We look at two of them:command. We look at two of them:
Simple merge on idSimple merge on id
Multiple merge on idMultiple merge on id
-
8/6/2019 Data Cleaning Public
8/11
Felicity Clemens 18 May 2005Felicity Clemens 18 May 2005
Identifying a movingIdentifying a moving
targettarget Scenario: we have data for each town givingScenario: we have data for each town giving
the chemical concentration for each yearthe chemical concentration for each year
between 1982 and 2002between 1982 and 2002 Problem: we need to identify the year countingProblem: we need to identify the year counting
backwards from 2002 in which the chemicalbackwards from 2002 in which the chemical
changed from its 2002 levelchanged from its 2002 level
Why? We need to overwrite the 2002 valueWhy? We need to overwrite the 2002 valuewith a new value, and overwrite backwardswith a new value, and overwrite backwards
until the value changeduntil the value changed
-
8/6/2019 Data Cleaning Public
9/11
Felicity Clemens 18 May 2005Felicity Clemens 18 May 2005
Identifying a movingIdentifying a moving
target (2)target (2)rescode y1990 y1991 y1992
1010113 65 32 32
1010114 41 41 41
1010115 78 23 23
1010116 44 44 44
1010117 82 82 29
1010118 25 25 25
1010119 12 12 6
1010120 40 12 7
-
8/6/2019 Data Cleaning Public
10/11
Felicity Clemens 18 May 2005Felicity Clemens 18 May 2005
Identifying a movingIdentifying a moving
target (3)target (3)We will use the forval loop to examine theWe will use the forval loop to examine the
relationship between each yearsrelationship between each years
observed value and the observed valueobserved value and the observed valuefor the previous yearfor the previous year
-
8/6/2019 Data Cleaning Public
11/11
Felicity Clemens 18 May 2005Felicity Clemens 18 May 2005
ummarySummary
Identifying duplicatesIdentifying duplicates can be done bycan be done by
hand or automatically using thehand or automatically using the
duplicates set of commandsduplicates set of commands
Use of the merge commandUse of the merge command to mergeto merge
on a specific variable, to multiply mergeon a specific variable, to multiply merge
datasetsdatasets Generating a moving target variableGenerating a moving target variable thethe
use of the forval loopuse of the forval loop