data collection, checking and cleaning & introduction to ... · data collection, checking and...
TRANSCRIPT
+
Data collection, checking and cleaning &Introduction to presenting statistical analyses
Zulma RuedaProfessor Universidad Pontificia Bolivariana, ColombiaAdjunct professor, University of Manitoba, [email protected]
+The use of statistics in research
nMedical statistics is the tool by which numerical information can translated into evidence
nThis evidence might be for the cause of a disease or for the effectiveness of an intervention
Peacock J, Kerry S. Presenting medical statistics from proposal to publication. Oxford University Press, 2007
+
n The positive side: researchers can now be independent, they don’t need a friendly statistician J
n The downside: n because statistical programs are easy to use, it is equally
easy to perform the wrong analysis Ln If the right analysis is performed, the programs often
produce a large amount of output: some relevant and some irrelevant
n The program (no matter if it is very sophisticated), do not generally tell the user whether a particular analysis is valid or not
The use of statistics in research
Peacock J, Kerry S. Presenting medical statistics from proposal to publication. Oxford University Press, 2007
+
n The aim: to share newly gained knowledge with others so that they can benefit from the findings for future research, professional practice, or both
n Sometimes the presentation of statistical information is not straightforward, and yet inadequate presentation will fail to communicate the relevant information, and may even communicate misleading or incorrect information
The use of statistics in research
Peacock J, Kerry S. Presenting medical statistics from proposal to publication. Oxford University Press, 2007
+Data collection
n It is related to the study protocol: what is actually collected and the format it is in
n It is important to know in advance what we are going to do with the data in order to ensure that it is collected in the right format
n It may be possible to answer a research question using existing data
Peacock J, Kerry S. Presenting medical statistics from proposal to publication. Oxford University Press, 2007
+ht
tps:
//w
ww
.cd
c.g
ov/i
mm
igra
ntre
fug
eehe
alth
/pd
f/tu
ber
culo
sis-
ti-2
009.
pd
f
+ht
tps:
//w
ww
.cd
c.g
ov/i
mm
igra
ntre
fug
eehe
alth
/pd
f/tu
ber
culo
sis-
ti-2
009.
pd
f
+Variables you collect in your daily work
n The medical history should focus on risk factors for tuberculosis disease including: previous history of tuberculosis; illness suggestive of tuberculosis (such as cough of >3 weeks’ duration, dyspnea, weight loss, fever, or hemoptysis); prior treatment suggestive of tuberculosis treatment; and prior diagnostic evaluation suggestive of tuberculosis. …
n … for children… fever, night sweats, growth delay, and weight loss.
n … inquiries regarding family or household contact with a person who has or had tuberculosis or illness, treatment, or diagnostic evaluation suggestive of tuberculosis.
n BCG vaccination
https://www.cdc.gov/immigrantrefugeehealth/pdf/tuberculosis-ti-2009.pdf
+Points concerning using existing datasets
n Can be cheaper and quicker than collecting new data
n The research question needs to be defined and researched in the same way as for a primary study
n A clear analysis plan is needed to avoid over-analysing the data
n Note that the dataset may not contain all the necessary information to answer the new question
Peacock J, Kerry S. Presenting medical statistics from proposal to publication. Oxford University Press, 2007
+Points concerning using existing datasets
n Where data are analysed that have been collected for another purpose, this is sometimes referred to as secondary analysis
Peacock J, Kerry S. Presenting medical statistics from proposal to publication. Oxford University Press, 2007
+When collecting original data:Beware of collecting too much data
nAsk yourself: nwhy are you collecting it?nWill it actually be analysed?
nDisadvantages of long questionnaires are:nThey may discourage people from taking
part and so lower the response rate or for answering all question
nQuestions may be answered less carefully
Peacock J, Kerry S. Presenting medical statistics from proposal to publication. Oxford University Press, 2007
+Beware of collecting too much data
nData processing time may be increased and results may be delayed
nThey may lead to multiple hypothesis test which increase the chance of spurious significance
nTime and money may be wasted
nHowever, it is important to collect what you do need as it may be difficult to get it later
Peacock J, Kerry S. Presenting medical statistics from proposal to publication. Oxford University Press, 2007
+Example of forms that you fill:
https://www.reginfo.gov/public/do/PRAViewIC?ref_nbr=201105-1405-007&icID=38745
+
https://www.reginfo.gov/public/do/PRAViewIC?ref_nbr=201105-1405-007&icID=38745
Example of forms that you fill:
+Transferring the data to computer: coding and data entry
n Before non-numeric data from a questionnaire of data collection form are entered to a computer, the responses need to be coded
n A unique number should be assigned to each possible response to facilitate statistical analysis
n The question would be coded 1: Immigrant, 2: special immigrant, 3: diversity, etc… according to the single answer given
Peacock J, Kerry S. Presenting medical statistics from proposal to publication. Oxford University Press, 2007
+Transferring the data to computer: coding and data entry
n Please tick all that apply:
n Although this is one question each person may tick a number of options. This needs to be entered as five separate variables each coded as ‘no’ or ‘yes’, which could be entered as 0 or 1:
n Tuberculosis disease: 0 or 1
n Syphilis, untreated: 0 or 1 …Peacock J, Kerry S. Presenting medical statistics from proposal to publication. Oxford University Press, 2007
+
n Missing data are undesirable in any study
n It may be important to be able to distinguish between data which are missing because the subject failed to respond (i.e. missed the question out completely), or where the answer was ‘don’t know’
n Create ‘don’t know’ as a valid answer and assigning it a separate code from answers which are truly missing (A blank or a code that is not valid for the other answers (e.g. 9)
Transferring the data to computer: coding and data entry
Peacock J, Kerry S. Presenting medical statistics from proposal to publication. Oxford University Press, 2007
+ Portion of a spreadsheet showing data collected from participants
Unique patient ID number
Variables name up to 8 characters long beginning
with a letter
Year of birth
Have you had rhinitis with a cold in the last 12 months
When did you have rhinitis
Repeat measurements given similar but unique variable names
Third measurements of pulse, systolic and diastolic blood measure
One line for each patient
Blank fields indicate
missing data
0= No1= Yes
1= Dry season2= Wet season3= Anytime
Peacock J, Kerry S. Presenting medical statistics from proposal to publication. Oxford University
Press, 2007
+Data checking and cleaning
n It is advisable to check the data as the study progresses rather than leave it until the end, as early checks may reveal problems which can be resolved
n Look for unlikely or impossible values or outliers, i.e. ‘DIAT2= 182’
n Possible errors like this can be identified if summary statistics and/or a histogram of the data
Peacock J, Kerry S. Presenting medical statistics from proposal to publication. Oxford University Press, 2007
+Data checking and cleaning
Peacock J, Kerry S. Presenting medical statistics from proposal to publication. Oxford University Press, 2007
+
n In some data entry programs, the user can set acceptable limits for each variable and thus force the computer to flag or reject values outside that range
n Errors can also occur where the data have been incorrectly entered but the value entered is a possible value, and so is not flagged
Data checking and cleaning
Peacock J, Kerry S. Presenting medical statistics from proposal to publication. Oxford University Press, 2007
+Checking for errors in the data
Data inconsistentPatient does not have rhinitis but answered
question about when rhinitis occurred
Value outside likely range:Diastolic blood pressure high although possible.
However measurements also inconsistent with first and third readings; possible transcription error?
Value outside likely rangeDiastolic blood pressure and systolic blood pressure too high and pulse too low. Measurements also inconsistent with first and third readings; likely that machine
was not working properly for this set of readings
Peacock J, Kerry S. Presenting medical statistics from proposal to publication. Oxford University Press, 2007
+Pe
acoc
k J,
Kerr
y S.
Pre
sent
ing
med
ical
sta
tistic
s fr
om p
rop
osal
to p
ublic
atio
n. O
xfor
d
Uni
vers
ity P
ress
, 200
7
+Sample size for Chlamydia prevalence study
n Aim: n To calculate the prevalence of Chlamydia infections
among women attending the GP for cervical smears
n Information required: n Estimate of the prevalence= 7% (from previous studies)
n Confidence level= 95% (decided by the researcher)
n Accuracy of +/- 1.4 percentage points (decided by the researcher)
n Required sample size:n 1300 women (from Epi-info)
Peacock J, Kerry S. Presenting medical statistics from proposal to publication. Oxford University Press, 2007
+Sample size for sensitivityn Aim:
n To calculate the sensitivity and specificity of nuchal translucency screening for chromosomal abnormalities using an unselected cohort of pregnant women
n Information required: n Estimate of the prevalence of chromosomal abnormalities=
1% (from previous studies)
n Estimate of sensitivity= 70% (from previous studies)
n Confidence level= 95% (decided by the researcher)
n Accuracy of +/- 20% points (decided by the researcher)
n Required sample size:n 20 babies with chromosomal abnormalities (1% of population)
and hence 2000 pregnant women required overallPeacock J, Kerry S. Presenting medical statistics from proposal to publication. Oxford University Press, 2007
+Sample size for comparing two proportionsn Aim: To compare prevalence of death or chronic lung disease in
premature babies randomized to methods of ventilation
n Outcome: Death or CLD at 36 weeks post menstrual age
n Information required: n Estimate of the prevalence in control group= 67% (from previous
studies)
n Significance level= 5% (decided by the researcher)
n Risk difference to be detected= 11% (i.e. 56% in intervention g)
n Power: 90% (decided by researcher)
n Babies will be randomized to two equal- sized groups)
n Required sample size: 428 babies in each group
Peacock J, Kerry S. Presenting medical statistics from proposal to publication. Oxford University Press, 2007
+Sample size for comparing two means
n Aim: n To compare mean birthweight of babies in different social
class subgroup
n Information required: n Estimate of the standard deviation of the birthweight= 500g
(from previous studies)
n Significance level= 95% (decided by the researcher)
n Difference to be detected= 180g (from previous studies)
n Power= 90% (decided by the researcher)
n Required sample size:n 163 women in each group
Peacock J, Kerry S. Presenting medical statistics from proposal to publication. Oxford University Press, 2007
+Presenting numerical data and common statistics
n Proportionsn Give to two significant figures (e.g. 0.25, 0.0056)
n Give numbers as well as the actual proportion unless obvious
n Use percentage or rate per 1000, 10000, etc. if proportion is very small
n Percentagesn Give percentages less than 10 or greater than 90 to one
decimal place (e.g. 5.2%, 93.8%)
n Consider giving percentages between 10% and 90% as whole numbers, unless the extra precision is needed (e.g. 27% vs 37%)
Peacock J, Kerry S. Presenting medical statistics from proposal to publication. Oxford University Press, 2007
+Presenting numerical data and common statistics
nPercentagesn Give numbers as well as actual percentage
unless obvious but make clear which is whichn Do not use percentages if sample is less than 10
n Mean, SD, SEn Present to one more significant figure than
original datan Do not use +/- as this is potentially ambiguous.
Use ‘mean (SD)=…’ or ‘mean (SE)= …’
Peacock J, Kerry S. Presenting medical statistics from proposal to publication. Oxford University Press, 2007
+Presenting numerical data and common statistics
n CIsn Present to one more significant figure than original data
n Present as ‘2 to 4’ or ‘2, 4’ not ‘2-4’ since this is ambiguous if negative values are possible
n P values
n Present actual P values wherever possible whether significant or not
n Give no more than two significant figures, e.g. 0.0392 –0.039; 0.596 – 0.60
n If package gives P= 0.0000 present as <0.0001
Peacock J, Kerry S. Presenting medical statistics from proposal to publication. Oxford University Press, 2007
+Beginning the results section
n Before performing the main analyses it is important to describe how the sample was obtained and the main relevant characteristics:n Sampling framen Number of subjects originally selectedn Number of subjects subsequently excluded because
of ineligibilityn Number of non-responders or no datan Comparison of responders and non-responders if
possiblen Number of subjects withdrawing before completing
the study
Peacock J, Kerry S. Presenting medical statistics from proposal to publication. Oxford University Press, 2007
+
Rueda ZV, López L, Vélez LA, Marín D, Giraldo MR, et al. (2013) High Incidence of Tuberculosis, Low Sensitivity of Current Diagnostic Scheme and Prolonged Culture Positivity in Four Colombian Prisons. A Cohort Study. PLoS ONE
8(11): e80592.
+
Nuzzo JB, Golub JE, Chaulk P, Shah M. Postarrival Tuberculosis Screening of High-Risk Immigrants at a Local Health Department. Am J Public Health. 2015 Jul;105(7):1432-8.
+Guidelines for tables
n Title should explain what the graph is about and what subjects or observations are included
n Give number of subjects or observations in each group
n Label rows and columns clearly
n Give confidence intervals for comparisons, not just P values
n Give SD, SE, or CI for means
Peacock J, Kerry S. Presenting medical statistics from proposal to publication. Oxford University Press, 2007
+Guidelines for tables
n Give percentages alongside frequencies unless group size is less than 10
n Give range or IQR for medians
n State units used
n Use consistent and appropriate decimal places
n Refer to table in the text
n Keep table simple for slides or poster and check text size for legibility
Peacock J, Kerry S. Presenting medical statistics from proposal to publication. Oxford University Press, 2007
+
Wilson FA, Miller TL, Stimpson JP. Mycobacterium Tuberculosis Infection, Immigration Status, and Diagnostic Discordance: A Comparison of Tuberculin Skin Test and QuantiFERON-TB Gold In-Tube Test Among Immigrants to the U.S. Public Health Rep. 2016 Mar-Apr;131(2):303-10
+Guidelines for graphs
n Title should explain what the graph is about and what subjects or observations are included
n Give number of subjects or observations
n Label axes, giving units as appropriate
n Refer to graph in the text
n Does the graph show enough information to justify the space it takes?
Peacock J, Kerry S. Presenting medical statistics from proposal to publication. Oxford University Press, 2007
+Guidelines for graphs
n For clarity, use two-dimensional rather than three-dimensional graphs
n For a paper: is the graph necessary? Could the data be presented in another way?
n In a slide or poster: will the text be legible?
Peacock J, Kerry S. Presenting medical statistics from proposal to publication. Oxford University Press, 2007
+
Liu Y, Posey DL, Cetron MS, Painter JA. Effect of a culture-based screening algorithm on tuberculosis incidence in immigrants and refugees bound for the United States: a
population-based cross-sectional study. Ann Intern Med. 2015 Mar 17;162(6):420-8.
+
Liu Y, Posey DL, Cetron MS, Painter JA. Effect of a culture-based screening algorithm on tuberculosis incidence in immigrants and refugees bound for the United States: a
population-based cross-sectional study. Ann Intern Med. 2015 Mar 17;162(6):420-8.
+
Liu Y, Posey DL, Cetron MS, Painter JA. Effect of a culture-based screening algorithm on tuberculosis incidence in immigrants and refugees bound for the United States: a
population-based cross-sectional study. Ann Intern Med. 2015 Mar 17;162(6):420-8.
+Li
u Y
, Pos
ey D
L, C
etro
nM
S, P
aint
er JA
. Effe
ct o
f a c
ultu
re-b
ased
scr
eeni
ng a
lgor
ithm
on
tub
ercu
losi
s in
cid
ence
in im
mig
rant
s an
d r
efug
ees
bou
nd fo
r th
e U
nite
d S
tate
s: a
p
opul
atio
n-b
ased
cro
ss-s
ectio
nal s
tud
y. A
nn In
tern
Med
. 201
5 M
ar 1
7;16
2(6)
:420
-8.
+Comparing two or more sets of data
https://www.pinterest.com/pin/184084703492988528/
+R
ued
a ZV
, Lóp
ez L
, Vél
ez L
A, M
arín
D, G
iral
do
MR
, et a
l. (2
013)
Hig
h In
cid
ence
of
Tub
ercu
losi
s, L
ow S
ensi
tivi
ty o
f Cur
rent
Dia
gno
stic
Sch
eme
and
Pro
long
ed C
ultu
re P
osit
ivit
y in
Fou
r C
olom
bia
n Pr
ison
s. A
Coh
ort S
tud
y. P
LoS
ON
E 8
(11)
: e80
592.
+R
ued
a ZV
, Lóp
ez L
, Vél
ez L
A, M
arín
D, G
iral
do
MR
, et a
l. (2
013)
Hig
h In
cid
ence
of
Tub
ercu
losi
s, L
ow S
ensi
tivi
ty o
f Cur
rent
Dia
gno
stic
Sch
eme
and
Pro
long
ed C
ultu
re
Posi
tivi
ty in
Fou
r C
olom
bia
n Pr
ison
s. A
Coh
ort S
tud
y. P
LoS
ON
E 8
(11)
: e80
592.
+From association to causation
Gordis L. Epidemiology 5th. ed. Elseiver; 2014
+Correlation does not implycausation
Gordis L. Epidemiology 5th. ed. Elseiver; 2014
+Guidelines for judging whether an observed association is Causal
nTemporal relationshipnStrength of the associationnDose-response relationshipnReplication of the findingsnBiologic plausibilitynConsideration of alternate explanationsnCessation of exposurenConsistency with other knowledgenSpecificity of the association
Gordis L. Epidemiology 5th. ed. Elseiver; 2014
+We can not forget some key aspects about our study design
+
nThe extent that a test result reflects the true value, that is, it is valid, depends on minimizing two major classes of error: systematic error (bias) and random error Image taken from:
http://nothingnerdy.wikispaces.com/1+Physics+and+physical+measurement
+Two types of error
n Systematic error:n Poor accuracyn Reproducible n Due to selection &/or information bias, or confounding
n Random error:n Poor precisionn Not reproducible
Image taken from: http://nothingnerdy.wikispaces.com/1+Physics+and+physical+measurement
+Determining a study protocol
nIdentify all data handling and processing steps, from specimen collection to recording data in a database
nAssess the potential for error at each step, and the error tolerance
nDetermine the reliability of the selected measure across a range of values
+Internal and external validity
53
Gordis L. Epidemiology 5th. ed. Elseiver; 2014
+Checklist for writing up a research study
n Abstractn Stand-alone document
n Report main outcome with estimates and 95% CI if possible
n Draw valid conclusions
n Introductionn What is the research question?
n What do we know already?
n What are the gaps?
n What does this study add?
+Checklist for writing up a research study
n Methodsn Describe study design and conduct
n Choice of subjects
n Sample size
n Data collected
n Statistical analysis
n Resultsn Describe characteristics of the sample
n Describe findings
n Don’t just give P values –present estimates and CIs
+Checklist for writing up a research study
n Discussionn Summarise findings
n Describe how they fit with existing knowledge
n Discuss any limitations
n Draw conclusions and make suggestions for future research
n Check this webpage for reporting guidelines:n http://www.equator-network.orgn http://collections.plos.org/reporting-guidelines
+
Thanks!!!