statistic i: data collection & handling
DESCRIPTION
Statistic I: Data Collection & HandlingTRANSCRIPT
©[email protected] 2013
Collect, Explore & Summarise
Dr Azmi Mohd TamilDept of Community Health
Universiti Kebangsaan Malaysia
FK6163
©[email protected] 2013
Data Collection
Data collection begins after deciding on design of study and the sampling strategy
©[email protected] 2013
Data Collection
Sample subjects are identified and the required individual information is obtained in an item-wise and structured manner.
©[email protected] 2013
Data Collection
Information is collected on certain characteristics, attributes and the qualities of interest from the samples
These data may be quantitative or qualitative in nature.
©[email protected] 2013
Data Collection Techniques
Use available information Observation Interviews Questionnaires Focus group discussion
©[email protected] 2013
Using Available Information
Existing Records• Hospital records - case notes• National registry of births & deaths• Census data• Data from other surveys
©[email protected] 2013
Disadvantages of using existing records
Incomplete records Cause of death may not be verified by a
physician/MD Missing vital information Difficult to decipher May not be representative of the target
group - only severe cases go to hospital
©[email protected] 2013
©[email protected] 2013
Disadvantages of using existing records
Delayed publication - obsolete data Different method of data recording
between institutions, states, countries, making comparison & pooling of data incompatible
Comparisons across time difficult due to difference in classification, diagnostic tools etc
©[email protected] 2013
Advantages of using existing records
Cheap convenient in some situations, it is the only data
source i.e. accidents & suicides
©[email protected] 2013
Observation
Involves systematically selecting, watching & recording behaviour and characteristics of living beings, objects or phenomena
Done using defined scales Participant observation e.g. PEF and
asthma symptom diary Non-participant observation e.g.
cholesterol levels
©[email protected] 2013
Interviews
Oral questioning of respondents either individually or as a group.
Can be done loosely or highly structured using a questionnaire
©[email protected] 2013
Administering Written Questionnaires
Self-administered via mail by gathering them in one place and
getting them to fill it up hand-delivering and collecting them later Large non-response can distort results
©[email protected] 2013
Questionnaires
Influenced by education & attitude of respondent esp. for self-administered
Interviewers need to be trained open ended vs close ended the need for pre-testing or pilot study
©[email protected] 2013
Focus group discussion
Selecting relevant parties to the research questions at hand and discussing with them in focus groups
examples in your own field of interest?
©[email protected] 2013
Plan for data collection
Permission to proceed Logistics - who will collect what, when
and with what resources Quality control
©[email protected] 2013
Accuracy & Reliability
Accuracy - the degree which a measurement actually measures the measures the characteristic it is supposed to measure
Reliability is the consistency of replicate measures
©[email protected] 2013
Reliability & Accuracy
©[email protected] 2013
Accuracy & Reliability
Both are reduced by random error and systematic error from the same sources of variability;• the data collectors• the respondents• the instrument
©[email protected] 2013
Strategies to enhance precision & accuracy
Standardise procedures and measurement methods
training & certifying the data collectors Repetition Blinding
©[email protected] 2013
Introduction
Method of Exploring and Summarising Data differs
According to Types of Variables
©[email protected] 2013
Dependent/Independent
Frequency of Exercise
Obesity
Food Intake
Independent Variables
Dependent Variable
©[email protected] 2013
©[email protected] 2013
Explore
It is the first step in the analytic process to explore the characteristics of the data to screen for errors and correct them to look for distribution patterns - normal
distribution or not May require transformation before further
analysis using parametric methods Or may need analysis using non-parametric
techniques
©[email protected] 2013
Data Screening
By running frequencies, we may detect inappropriate responses
How many in the audience have 15 children and currently pregnant with the 16th?
PARITY
67 30.7
44 20.2
36 16.5
22 10.1
21 9.6
8 3.7
3 1.4
7 3.2
5 2.3
3 1.4
1 .5
1 .5
218 100.0
1
2
3
4
5
6
7
8
9
10
11
15
Total
ValidFrequency Percent
©[email protected] 2013
Data Screening
See whether the data make sense or not.
E.g. Parity 10 but age only 25.
©[email protected] 2013
©[email protected] 2013
©[email protected] 2013
Data Screening
By looking at measures of central tendency and range, we can also detect abnormal values for quantitative data
Descriptive Statistics
184 32 484 53.05 33.37
184
Pre-pregnancy weight
Valid N (listwise)
N Minimum Maximum MeanStd.
Deviation
©[email protected] 2013
Interpreting the Box Plot
Outlier
Outlier
Upper quartile
Smallest non-outlier
Median
Lower quartile
Largest non-outlier The whiskers extend to 1.5 times the box width from both ends of the box and ends at an observed value. Three times the box width marks the boundary between "mild" and "extreme" outliers.
"mild" = closed dots "extreme"= open dots
©[email protected] 2013
Data Screening
We can also make use of graphical tools such as the box plot to detect wrong data entry 184N =
Pre-pregnancy weight
600
500
400
300
200
100
0
141198211181
73
©[email protected] 2013
Data Cleaning
Identify the extreme/wrong values Check with original data source – i.e.
questionnaire If incorrect, do the necessary correction. Correction must be done before
transformation, recoding and analysis.
©[email protected] 2013
Parameters of Data Distribution
Mean – central value of data Standard deviation – measure of how
the data scatter around the mean Symmetry (skewness) – the degree of
the data pile up on one side of the mean Kurtosis – how far data scatter from the
mean
©[email protected] 2013
Normal distribution
The Normal distribution is represented by a family of curves defined uniquely by two parameters, which are the mean and the standard deviation of the population.
The curves are always symmetrically bell shaped, but the extent to which the bell is compressed or flattened out depends on the standard deviation of the population.
However, the mere fact that a curve is bell shaped does not mean that it represents a Normal distribution, because other distributions may have a similar sort of shape.
©[email protected] 2013
Normal distribution
If the observations follow a Normal distribution, a range covered by one standard deviation above the mean and one standard deviation below it includes about 68.3% of the observations;
a range of two standard deviations above and two below (+ 2sd) about 95.4% of the observations; and
of three standard deviations above and three below (+ 3sd) about 99.7% of the observations
68.3%
95.4%
99.7%
©[email protected] 2013
Normality
Why bother with normality?? Because it dictates the type of analysis
that you can run on the data
©[email protected] 2013
Normality-Why?Parametric
©[email protected] 2013
Normality-Why?Non-parametric
©[email protected] 2013
Normality-How?
Explored graphically• Histogram• Stem & Leaf• Box plot• Normal probability
plot• Detrended normal
plot
Explored statistically• Kolmogorov-Smirnov
statistic, with Lilliefors significance level and the Shapiro-Wilks statistic
• Skew ness (0)• Kurtosis (0)
– + leptokurtic– 0 mesokurtik– - platykurtic
©[email protected] 2013
Kolmogorov- Smirnov
In the 1930’s, Andrei Nikolaevich Kolmogorov (1903-1987) and N.V. Smirnov (his student) came out with the approach for comparison of distributions that did not make use of parameters.
This is known as the Kolmogorov-Smirnov test.
©[email protected] 2013
Skew ness
Skewed to the right indicates the presence of large extreme values
Skewed to the left indicates the presence of small extreme values
©[email protected] 2013
Kurtosis
For symmetrical distribution only.
Describes the shape of the curve
Mesokurtic - average shaped
Leptokurtic - narrow & slim
Platikurtic - flat & wide
©[email protected] 2013
Skew ness & Kurtosis
Skew ness ranges from -3 to 3. Acceptable range for normality is skew ness
lying between -1 to 1. Normality should not be based on skew ness
alone; the kurtosis measures the “peak ness” of the bell-curve (see Fig. 4).
Likewise, acceptable range for normality is kurtosis lying between -1 to 1.
©[email protected] 2013
©[email protected] 2013
Normality - ExamplesGraphically
Height
167.5
165.0
162.5
160.0
157.5
155.0
152.5
150.0
147.5
145.0
142.5
140.0
60
50
40
30
20
10
0
Std. Dev = 5.26
Mean = 151.6
N = 218.00
©[email protected] 2013
Q&Q Plot
This plot compares the quintiles of a data distribution with the quintiles of a standardised theoretical distribution from a specified family of distributions (in this case, the normal distribution).
If the distributional shapes differ, then the points will plot along a curve instead of a line.
Take note that the interest here is the central portion of the line, severe deviations means non-normality. Deviations at the “ends” of the curve signifies the existence of outliers.
©[email protected] 2013
Normality - ExamplesGraphically
Detrended Normal Q-Q Plot of Height
Observed Value
170160150140130
De
v fr
om
No
rma
l
.6
.5
.4
.3
.2
.1
0.0
-.1
-.2
Normal Q-Q Plot of Height
Observed Value
170160150140130
Exp
ect
ed
No
rma
l
3
2
1
0
-1
-2
-3
©[email protected] 2013
Normal distributionMean=median=mode
©[email protected] 2013
Tests of Normality
.060 218 .052HeightStatistic df Sig.
Kolmogorov-Smirnova
Lilliefors Significance Correctiona.
Descriptives
151.65 .356
150.94
152.35
151.59
151.50
27.649
5.258
139
168
29
8.00
.148 .165
.061 .328
Mean
Lower Bound
Upper Bound
95% ConfidenceInterval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
HeightStatistic Std. Error
Normality - ExamplesStatistically
Shapiro-Wilks; only if sample size less than 100.
Normal distributionMean=median=mode
Skewness & kurtosis within +1
p > 0.05, so normal distribution
©[email protected] 2013
K-S Test
©[email protected] 2013
K-S Test
very sensitive to the sample sizes of the data.
For small samples (n<20, say), the likelihood of getting p<0.05 is low
for large samples (n>100), a slight deviation from normality will result in being reported as abnormal distribution
©[email protected] 2013
Guide to deciding on normality
©[email protected] 2013
Normality Transformation
Normal Q-Q Plot of LN_PARIT
Observed Value
3.02.52.01.51.0.50.0-.5
Exp
ect
ed
No
rma
l
3
2
1
0
-1
-2
Normal Q-Q Plot of LN_PARIT
Observed Value
3.02.52.01.51.0.50.0-.5
Exp
ect
ed
No
rma
l
3
2
1
0
-1
-2
Normal Q-Q Plot of PARITY
Observed Value
1614121086420
Exp
ect
ed
No
rma
l
3
2
1
0
-1
-2
Normal Q-Q Plot of PARITY
Observed Value
1614121086420
Exp
ect
ed
No
rma
l
3
2
1
0
-1
-2
©[email protected] 2013
Square root Logarithm Inverse
Reflect and square root
Reflect and logarithm
Reflect and inverse
TYPES OF TRANSFORMATIONS
©[email protected] 2013
Summarise
Summarise a large set of data by a few meaningful numbers.
Single variable analysis• For the purpose of describing the data• Example; in one year, what kind of cases are
treated by the Psychiatric Dept?• Tables & diagrams are usually used to describe
the data• For numerical data, measures of central tendency
& spread is usually used
©[email protected] 2013
Frequency Table
Race F %Malay 760 95.84%
Chinese 5 0.63%Indian 0 0.00%Others 28 3.53%TOTAL 793 100.00%
•Illustrates the frequency observed for each category
©[email protected] 2013
Frequency Distribution Table
Umur Bil %0-0.99 25 3.26%1-4.99 78 10.18%5-14.99 140 18.28%15-24.99 126 16.45%25-34.99 112 14.62%35-44.99 90 11.75%45-54.99 66 8.62%55-64.99 60 7.83%65-74.99 50 6.53%75-84.99 16 2.09%85+ 3 0.39%JUMLAH 766
• > 20 observations, best presented as a frequency distribution table.
•Columns divided into class & frequency.
•Mod class can be determined using such tables.
©[email protected] 2013
Measurement of Central Tendency & Spread
©[email protected] 2013
Measures of Variability
Standard deviation Inter-quartilesSkew ness & kurtosis
©[email protected] 2013
Mean
the average of the data collected To calculate the mean, add up the
observed values and divide by the number of them.
A major disadvantage of the mean is that it is sensitive to outlying points
©[email protected] 2013
Mean: Example
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Total of x = 648 n= 20 Mean = 648/20 = 32.4
©[email protected] 2013
Measures of variation - standard deviation
tells us how much all the scores in a dataset cluster around the mean. A large S.D. is indicative of a more varied data scores.
a summary measure of the differences of each observation from the mean.
If the differences themselves were added up, the positive would exactly balance the negative and so their sum would be zero.
Consequently the squares of the differences are added.
©[email protected] 2013
©[email protected] 2013
sd: Example
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Mean = 32.4; n = 20 Total of (x-mean)2
= 3050.8 Variance = 3050.8/19
= 160.5684 sd = 160.56840.5=12.67
x (x-mean)^2 x (x-mean)^2
12 416.16 32 0.16
13 376.36 35 6.76
17 237.16 37 21.16
21 129.96 38 31.36
24 70.56 41 73.96
24 70.56 43 112.36
26 40.96 44 134.56
27 29.16 46 184.96
27 29.16 53 424.36
30 5.76 58 655.36
TOTAL 1405.8 TOTAL 1645
©[email protected] 2013
Median
the ranked value that lies in the middle of the data
the point which has the property that half the data are greater than it, and half the data are less than it.
if n is even, average the n/2th largest and the n/2 + 1th largest observations
"robust" to outliers
©[email protected] 2013
Median:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
(20+1)/2 = 10th which is 30, 11th is 32 Therefore median is (30 + 32)/2 = 31
©[email protected] 2013
Measures of variation - quartiles
The range is very susceptible to what are known as outliers
A more robust approach is to divide the distribution of the data into four, and find the points below which are 25%, 50% and 75% of the distribution. These are known as quartiles, and the median is the second quartile.
©[email protected] 2013
Quartiles
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
25th percentile 24; (24+24)/2 50th percentile 31; (30+32)/2 ; = median 75th percentile 42.5; (41+43)/2
©[email protected] 2013
Mode
The most frequent occurring number. E.g. 3, 13, 13, 20, 22, 25: mode = 13.
It is usually more informative to quote the mode accompanied by the percentage of times it happened; e.g., the mode is 13 with 33% of the occurrences.
©[email protected] 2013
Mode: Example
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Modes are 24 (10%) & 27 (10%)
©[email protected] 2013
Mean or Median?
Which measure of central tendency should we use?
if the distribution is normal, the mean+sd will be the measure to be presented, otherwise the median+IQR should be more appropriate.
©[email protected] 2013
Graphing Categorical Data: Univariate Data
Categorical Data
Tabulating Data
The Summary Table
0 1 0 2 0 3 0 4 0 5 0
S to c k s
B o n d s
S a vin g s
C D
Graphing Data
Pie Charts
Pareto DiagramBar Charts
0
5
1 0
1 5
2 0
2 5
3 0
3 5
4 0
4 5
S to c k s B o n d s S a vin g s C D
0
2 0
4 0
6 0
8 0
1 0 0
1 2 0
©[email protected] 2013
Bar Chart
Type of work
Field w orkOffice w orkHousew ife
Pe
rce
nt
80
60
40
20
0
20
11
69
©[email protected] 2013
Tabulating and Graphing Bivariate
Categorical Data Contingency tables:
Table 1: Contigency table of pregnancy induced hypertension andSGA
Count
103 94 197
5 16 21
108 110 218
No
Yes
Pregnancy inducedhypertension
Total
Normal SGA
SGA
Total
©[email protected] 2013
Tabulating and Graphing Bivariate
Categorical Data
Pregnancy induced hypertension
YesNo
Co
un
t
120
100
80
60
40
20
0
SGA
Normal
SGA
16
94
103
Side by side charts
©[email protected] 2013
Ogive
0
20
40
60
80
100
120
10 20 30 40 50 60
Tabulating and Graphing Numerical
Data
0
1
2
3
4
5
6
7
10 20 30 40 50 60
Numerical Data
Ordered Array
Stem and LeafDisplay
Histograms Area
Tables
2 144677
3 028
4 1
41, 24, 32, 26, 27, 27, 30, 24, 38, 21
21, 24, 24, 26, 27, 27, 30, 32, 38, 41
Frequency DistributionsCumulative Distributions
Polygons
©[email protected] 2013
Tabulating Numerical Data: Frequency
Distributions Sort raw data in ascending order:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Find range: 58 - 12 = 46
Select number of classes: 5 (usually between 5 and 15)
Compute class interval (width): 10 (46/5 then round up)
Determine class boundaries (limits): 10, 20, 30, 40, 50, 60
Compute class midpoints: 14.95, 24.95, 34.95, 44.95, 54.95
Count observations & assign to classes
©[email protected] 2013
Frequency Distributions and Percentage Distributions
Data in ordered array:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Class Midpoint Freq %
10.0 - 19.9 14.95 3 15%
20.0 - 29.9 24.95 6 30%
30.0 - 39.9 34.95 5 25%
40.0 - 49.9 44.95 4 20%
50.0 - 59.9 54.95 2 10%
TOTAL 20 100%
©[email protected] 2013
3
6
5
4
2
0
1
2
3
4
5
6
7
14.95 24.95 34.95 44.95 54.95
Age
Fre
qu
en
cy
Graphing Numerical Data:
The HistogramData in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Class MidpointsClass Boundaries
No Gaps Between
Bars
©[email protected] 2013
Graphing Numerical Data:
The Frequency Polygon
Class Midpoints
Data in ordered array:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
0
1
2
3
4
5
6
7
14.95 24.95 34.95 44.95 54.95
©[email protected] 2013
Linear Regression Line
©[email protected] 2013
Survival Function
DURATION
76543210
Cu
m S
urv
iva
l
1.2
1.0
.8
.6
.4
.2
0.0
Survival Function
Censored
©[email protected] 2013
Principles of Graphical Excellence
Presents data in a way that provides substance, statistics and design
Communicates complex ideas with clarity, precision and efficiency
Gives the largest number of ideas in the most efficient manner
Almost always involves several dimensions Tells the truth about the data