data cleaning public

8/6/2019 Data Cleaning Public

1/11

Felicity Clemens 18 May 2005Felicity Clemens 18 May 2005

Data cleaning: hintsData cleaning: hints

and tipsand tipsFelicity ClemensFelicity Clemens

Stata Users Group meetingStata Users Group meetingLondon, 17 & 18London, 17 & 18thth May 2005May 2005


2/11


IntroductionIntroduction

Data cleaningData cleaning one of the most timeone of the most time

consuming jobs of all!consuming jobs of all!

Many ways of attacking the sameMany ways of attacking the same

problem when using Stataproblem when using Stata

The talk will describe some commonThe talk will describe some common

problems and propose possible solutionsproblems and propose possible solutions

These are mostly reminders!These are mostly reminders!


3/11


ContentsContents

1)1) Introduction to the first datasetsIntroduction to the first datasets

2)2) Identifying and removing duplicatesIdentifying and removing duplicates by handby hand

3)3) Merging data and uses of theMerging data and uses of themerge commandmerge command

4)4) Generating a moving targetGenerating a moving targetvariablevariable


4/11


The studyThe study

A caseA case--control study carried across 3control study carried across 3

central European countriescentral European countries

Exposure of interest: exposure toExposure of interest: exposure to

chemicals in the environmentchemicals in the environment

Outcome of interest: cancerOutcome of interest: cancer


5/11


Identifying duplicates in aIdentifying duplicates in a

datasetdataset This can be done automatically (usingThis can be done automatically (using

the duplicates set of commands)the duplicates set of commands)

We will demonstrate a manual method ofWe will demonstrate a manual method ofidentifying duplicatesidentifying duplicates

Two different possibilities:Two different possibilities:

The same data have been entered on moreThe same data have been entered on morethan one occasion;than one occasion;


6/11


Identifying duplicates in aIdentifying duplicates in a

datasetdataset This can be done automatically (usingThis can be done automatically (using

the duplicates set of commands)the duplicates set of commands)

We will demonstrate a manual method ofWe will demonstrate a manual method ofidentifying duplicatesidentifying duplicates

Two different possibilities:Two different possibilities:

The same data have been entered on moreThe same data have been entered on morethan one occasion;than one occasion;

Different data have been entered using theDifferent data have been entered using thesame identifier (id numbers)same identifier (id numbers)


7/11


The merge commandThe merge command

A necessary command in dataA necessary command in data

management of most big studiesmanagement of most big studies

There are many different uses of the mergeThere are many different uses of the merge

command. We look at two of them:command. We look at two of them:

Simple merge on idSimple merge on id

Multiple merge on idMultiple merge on id


8/11


Identifying a movingIdentifying a moving

targettarget Scenario: we have data for each town givingScenario: we have data for each town giving

the chemical concentration for each yearthe chemical concentration for each year

between 1982 and 2002between 1982 and 2002 Problem: we need to identify the year countingProblem: we need to identify the year counting

backwards from 2002 in which the chemicalbackwards from 2002 in which the chemical

changed from its 2002 levelchanged from its 2002 level

Why? We need to overwrite the 2002 valueWhy? We need to overwrite the 2002 valuewith a new value, and overwrite backwardswith a new value, and overwrite backwards

until the value changeduntil the value changed


9/11



target (2)target (2)rescode y1990 y1991 y1992

1010113 65 32 32

1010114 41 41 41

1010115 78 23 23

1010116 44 44 44

1010117 82 82 29

1010118 25 25 25

1010119 12 12 6

1010120 40 12 7


10/11



target (3)target (3)We will use the forval loop to examine theWe will use the forval loop to examine the

relationship between each yearsrelationship between each years

observed value and the observed valueobserved value and the observed valuefor the previous yearfor the previous year


11/11


ummarySummary

Identifying duplicatesIdentifying duplicates can be done bycan be done by

hand or automatically using thehand or automatically using the

duplicates set of commandsduplicates set of commands

Use of the merge commandUse of the merge command to mergeto merge

on a specific variable, to multiply mergeon a specific variable, to multiply merge

datasetsdatasets Generating a moving target variableGenerating a moving target variable thethe

use of the forval loopuse of the forval loop

data cleaning public

Documents