data analysis: 1. describing data
DESCRIPTION
Data analysis: 1. Describing data. Ana Jerončić , PhD Department for Research in Biomedicine and Health. Contact. E-mail: [email protected] Location: main building, 5th floor, room 512 Phone: 557-862. Contents of the 2nd week. Describing data - Central tendency and variability - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/1.jpg)
Data analysis:1. Describing
dataAna Jerončić, PhD
Department for Research in Biomedicine and Health
![Page 3: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/3.jpg)
Contents of the 2nd week
1. Describing data - Central tendency and variability
2. Estimation - Accuracy, precision, standard error, confidence intervals
3. Hypothesis testing - Test statistics, P-value, choice of a statistical test
4. Interpretation of data - Causality and association, odds ratio, risk, correlation, linear regression
5. Sources of error - Type 1 and type 2 errors, power, bias, confounding
![Page 4: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/4.jpg)
Learning outcome
Critical appraisal of scientific papers
NOT!Implementation of data analysis
![Page 5: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/5.jpg)
Why?
To identify the best available treatment To prevent “medical zombies” To perform your own research
![Page 6: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/6.jpg)
Data analysis:Describing data
1. How the data should be organized prior to data analysis
2. Data types3. Graphical & tabular techniques for
description, summary statistics Qualitative Data Quantitative Data
![Page 7: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/7.jpg)
Height measurements among 1st year medical students
157 204 184 186 197 155 169150 193 205 150 161 169 147167 159 187 173 146 179 201159 147 144 204 184 192 165146 169 198 164 182 165 173147 166 167 180 169 174 201146 151 203 171 186 179 152189 204 189 200 202 147 181145 161 173 155 203 190 164141 163 179 195 155 197 151197 141 146 202 149 197 203172 143 151 200 197 192 192160 173 187 172 177 179 188
![Page 8: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/8.jpg)
What is the unit of measurement? How many observations per subject ?
![Page 9: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/9.jpg)
Variables and their values, Units of observation
Entity Height(cm)
Weight(kg)
Age(years)
Sex(category)
Person 1Person 2Person 3**
176171182**
706075**
333862**
MaleFemaleMale**
OBSERVATIONS
VARIABLES
Measurement/Observation
![Page 10: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/10.jpg)
Variable Features of variables
Example Descriptive statistics
Informativeness level
Categorical,Nominal
Unordered/unarranged
categories
Gender, urbanization
Number, proportion
Low
Ordinal Orded/arranged categories
Grades, scales
Median Medium
Numerical Arranged categories with equal intervals
Height, weight
Mean or median
High
Types of data (variables)
![Page 11: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/11.jpg)
CategoricalNominal
QualitativeOrdinal
Numerical
Quantitative
Types of data
![Page 12: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/12.jpg)
Type of data?
Height Grades Age in years Weight Insuline concentration Blood glucose
![Page 13: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/13.jpg)
Type of data?
How many cigarettes do you smoke a day?
1-5 6-10 11-15 16-20 21 and more
![Page 14: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/14.jpg)
Type of data?
Have you ever had a heart attack? Yes No
Do you suffer from hypertension? Yes No ?
![Page 15: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/15.jpg)
Type of data?
Gender: Male Female
![Page 16: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/16.jpg)
Type of data?
Marital status: married divorced widowed single lives alone ?
![Page 17: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/17.jpg)
Type of data?
Education: elementary school high school two-year college four-year college ?
![Page 18: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/18.jpg)
Type of data?
Likert scale
Claim: Violence among the youth is becoming an increasing problem in Croatia.
I agree completely I agree Undecided I disagree I argue strongly against
1 2 3 4 5
![Page 19: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/19.jpg)
Type of data?
Visually analogous scale
E.g. pain level that examinee experiences
I don’t feel pain
I feel intolerable pain
![Page 20: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/20.jpg)
Variables – Transformation of variables
Numerical Distance is meaningfull
Ordinal Atributes can be ordered
Nominal Attributes are only named; weakest
![Page 21: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/21.jpg)
Watch out for… literature inconsistency in data type
classification
![Page 22: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/22.jpg)
Person No. Height [cm]Person 1 148Person 2 142Person 3 154Person 4 153Person 5 160Person 6 177Person 7 204Person 8 192Person 9 191Person 10 203Person 11 197Person 12 202Person 13 177
Observe and Describe
![Page 23: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/23.jpg)
Organized data are input forGraphical & Tabular data representations
Person No. Height [cm]Person 1 148Person 2 142Person 3 154Person 4 153Person 5 160Person 6 177Person 7 204Person 8 192Person 9 191Person 10 203Person 11 197Person 12 202Person 13 177
Observe and Describe
![Page 24: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/24.jpg)
QUALITATIVE DATA
![Page 25: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/25.jpg)
Tabular Techniques for Qualitative Variable(s) –YPEL5 example –
Contingency TableIn one study researchers investigated genotype of the
YPEL5 gene in a population sample from Split. They got the following results on 10 examinees :
IndividualYPEL5 Genotype
1 AA2 BB3 BB4 BB5 AB6 AB7 BB8 AA9 AB
10 BB
Genotype Frequency Relative Frequency
Relative Frequency [%]
AA 2 0.2 20%
AB 3 0.3 30%
BB 5 0.5 50%
Total 10 1.00 100%
Table Frequency Distribution of YPEL5 genotypes
proportion percentage
![Page 26: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/26.jpg)
Graphical Techniques for Qualitative Variable(s) –YPEL5 example – Bar Chart
Bar Charts are often used to display frequencies…
categories’ names
CountsOr Percentages
BB AA AB0
1
2
3
4
5
YPEL5 genotype
Freq
uen
cy
![Page 27: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/27.jpg)
Is there an association between the medicine taken and the length of cold?
(84%)
(16%)
(100%)
(19%)
(81%)
(100%)
![Page 28: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/28.jpg)
Graphical & Tabular Techniques
The only allowable calculation => count the frequency of category.
We can summarize the data in a contingency table that presents the categories and their counts called a frequency distribution.
A relative frequency distribution lists the categories and the proportion with which each occurs.
![Page 29: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/29.jpg)
Graphical Techniques for Qualitative Variable(s) –Bar Chart – pareto chart
Nominal data has no order. However, sometimes it is usefull to arrange the outcomes from the most frequently occurring to the least frequently occurring. We call this bar chart representation a “pareto chart”
categories’ names
counts
![Page 30: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/30.jpg)
Chart with relative frequency is more informative
categories’ names
percentages
Graphical Techniques for Qualitative Variable(s) –Bar Chart – pareto chart
![Page 31: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/31.jpg)
Pie Charts show relative frequencies…
Graphical Techniques for Qualitative Variable(s) –YPEL5 example - Pie Chart
50%
20%
30%
Pie Chart
BBAAAB
![Page 32: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/32.jpg)
Watch out for . . . Authors can use percentages to hide the true size of the data. To say that 50% of a sample has a certain condition
when there are only four people in the sample is clearly not providing the same level of information as 50% of a sample based on 400 people.
So, percentages should be used as an additional help for the reader rather than replacing the actual data
![Page 33: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/33.jpg)
Chart that changed the medicine
![Page 34: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/34.jpg)
QUANTITATIVE DATA
![Page 35: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/35.jpg)
Graphical Technique for Quantitative Data
Height measurements among 1st year medical students
Individual Height (cm)1 1862 1443 1754 1995 1496 1577 1508 1769 179
10 16511 15112 16413 16714 17515 19116 16317 18718 17619 18420 19121 17222 15123 179
Frequency distribution for quantitative data:
Building a Histogram
![Page 36: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/36.jpg)
Building a Histogram…
Category limits[cm] Freq.
Relative Freq.
Percent Relative Freq.
>140;<=150 3 0,13 13%150-160 3 0,13 13%160-170 4 0,17 17%170-180 7 0,30 30%180-190 5 0,22 22%190-200 1 0,04 4%Total 23 1,00 100%
Frequency distribution of height
145 155 165 175 185 1950%5%
10%15%20%25%30%35%
Height [cm]P
erce
nt
Rel
ativ
e Fr
equ
ency
![Page 37: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/37.jpg)
Graphical Techniques for Quantitative Data
There are several graphical methods that are used when the data are quantitative ( numeric).
The most important of these graphical methods is the histogram.
The histogram is not only a powerful graphical technique used to summarize interval data, but it is also used to help explain probabilities.
![Page 38: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/38.jpg)
Histogram simulations http://www.shodor.org/interactivate/activities/Histogram/
![Page 39: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/39.jpg)
![Page 40: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/40.jpg)
Summary: Graphs/Tables for Describing Data
Qualitative Frequency Distribution – tabular summary of data Bar Chart Pie Chart
Quantitative Frequency Distribution – tabular summary of data Histogram Line Chart (Time-Series Plot) Stem and Leaf Display
![Page 41: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/41.jpg)
Relationship between two variables
![Page 42: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/42.jpg)
Relationship between two variables
To compare two variables we use: Scatter plot/diagram (quantitative) Cross table (qualitative)
![Page 43: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/43.jpg)
Scatter plot – for two quantitative variables
Scatter plot, showing the strong association between enzyme activity at pH 5.5 and the 5α-reductase 2-specific mRNA expression, as expressed on the basis of β-actin (n = 30; rs = 0.81; 95% confidence interval, 0.64–0.91; P < 0.0001).
![Page 44: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/44.jpg)
Patterns of Scatter Diagrams…Linearity and Direction are two concepts we
are interested in
Positive Linear Relationship Negative Linear Relationship
Weak or Non-Linear Relationship
![Page 45: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/45.jpg)
Scatter plot
Analysis of expression level from microarray data
Squamous cell carcinoma tumor and perilesional display distinctly different scatter plots from normal tissue. Expresion levels for gene subset 1 in patient 1
![Page 46: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/46.jpg)
Cross Table - for two qualitative variables
Used to compare two qualitative variables
If first variable has r categories, second variable c categories, then we have an r×c cross table.
![Page 47: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/47.jpg)
Association of two qualitative variables
Disease X YES NO TOTAL
YPEL5 Genot
ype
AA 2 0 2
AB 1 3 4BB 0 4 4TOTAL 3 7 10
AA AB BB 0
1
2
3
4
Disease Healthy
Freq
uen
cy
Based on data presented do you think that YPEL5 could be associated with disease X?
![Page 49: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/49.jpg)
The projected clinical cost for breast cancer detection program in 2011-12, broken down by service category.
![Page 50: Data analysis: 1. Describing data](https://reader038.vdocument.in/reader038/viewer/2022102800/56812c60550346895d90ee2b/html5/thumbnails/50.jpg)
The results of measuring the height among med. students
Individual Height (cm)1 1862 1443 1754 1995 1496 1577 1508 1769 179
10 16511 15112 16413 16714 17515 19116 16317 18718 17619 18420 19121 17222 15123 179
Histogram – common mistake!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 230
50
100
150
200
250
subjects
Height [cm]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 230
50
100
150
200
250
Same data –with reshuffled subjects
subjects
Height [cm]